[jira] Commented: (HADOOP-449) Generalize the SequenceFileInputFilter to apply to any InputFormat

Enis Soztutar (JIRA) Thu, 07 Feb 2008 09:12:29 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566693#action_12566693
 ]


Enis Soztutar commented on HADOOP-449:
--------------------------------------

Thanks very much for the review, 

bq, It looks like one can't meaningfully nest FilterInputFormats. A check in 
initBaseInputFormat that rejects an attempt to do this would probably be a good 
idea.

In my opinion, people may try to nest WritableFilters rather than 
FilterInputFormats for multi-logic filtering(for example to filter by range and 
percent). However, I do not feel that chaining of more than one filters will be 
used(however one can always write a ChainFilter, which can be configured to 
apply more than one Filter sequentially). 

I think people will understand that nesting FilterInputFormat cannot be done 
with the current API : 
{noformat}

job.setInputFormat(FilterInputFormat.class);
FilterInputFormat.setBaseInputFormat(job, FilterInputFormat.class); //nesting
//now we should set the actual InputFormat
FilterInputFormat.setBaseInputFormat(job, TextInputFormat.class); //already 
confusing
{noformat}

bq. IIRC, Configuration::IntegerRanges is limited to positive integers, so the 
default range of (Integer.MIN_VALUE + "-" + Integer.MAX_VALUE) in 
IntRangeFilter::setConf may be invalid.
right. I have missed that IntegerRanges is limited to positive integers. I 
think we should make IntRangeFilter to extend ComparableRangeFilter instead(not 
using IntegerRanges anymore). 

bq. RangeFilter should probably accept a WritableComparator to support user 
types and an alternative syntax for projections
pls. see below
bq. Would it be too constraining to limit SetFilter and ItemFilter to Text? 
Filtering based on the string representation of Writables seems like an overly 
general strategy.
The main reason behind this weird strategy of using string comparison on 
serialized versions of the Writables is that we should somehow pass the 
specified writables(for example min and max values) to the tasks, and currently 
the only way for this is to store them in the configuration. It would be great 
if we have setWritable() and getWritable() methods in the configuration, so 
that we can then directly compares WritableComparables (possibly using 
WritableComparators), however the proposal to add these methods are lazily 
rejected (i cannot remember the issue number). 
another solution to this may be adding an interface for example 
WritableStorable, WritableDeserializer, which will provide 
Writable forName(String) method. 

If you see a better solution to pass the Writables to the tasks, I will be very 
glad to adopt it. Or should we add setWritable() getWritable() to the 
Configuration? 




> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>
>                 Key: HADOOP-449
>                 URL: https://issues.apache.org/jira/browse/HADOOP-449
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Owen O'Malley
>            Assignee: Enis Soztutar
>             Fix For: 0.17.0
>
>         Attachments: filterinputformat_v1.patch
>
>
> I'd like to generalize the SequenceFileInputFormat that was introduced in 
> HADOOP-412 so that it can be applied to any InputFormat. To do this, I 
> propose:
> interface WritableFilter {
>    boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
>   ...
> }
> FilterInputFormat would look in the JobConf for:
>    mapred.input.filter.source = the underlying input format
>    mapred.input.filter.filters = a list of class names that implement 
> WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an 
> internal RecordReader rather than the SequenceFile. This will require adding 
> a next(key) and getCurrentValue(value) to the RecordReader interface, but 
> that will be addressed in a different issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-449) Generalize the SequenceFileInputFilter to apply to any InputFormat

Reply via email to