[
https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566693#action_12566693
]
Enis Soztutar commented on HADOOP-449:
--------------------------------------
Thanks very much for the review,
bq, It looks like one can't meaningfully nest FilterInputFormats. A check in
initBaseInputFormat that rejects an attempt to do this would probably be a good
idea.
In my opinion, people may try to nest WritableFilters rather than
FilterInputFormats for multi-logic filtering(for example to filter by range and
percent). However, I do not feel that chaining of more than one filters will be
used(however one can always write a ChainFilter, which can be configured to
apply more than one Filter sequentially).
I think people will understand that nesting FilterInputFormat cannot be done
with the current API :
{noformat}
job.setInputFormat(FilterInputFormat.class);
FilterInputFormat.setBaseInputFormat(job, FilterInputFormat.class); //nesting
//now we should set the actual InputFormat
FilterInputFormat.setBaseInputFormat(job, TextInputFormat.class); //already
confusing
{noformat}
bq. IIRC, Configuration::IntegerRanges is limited to positive integers, so the
default range of (Integer.MIN_VALUE + "-" + Integer.MAX_VALUE) in
IntRangeFilter::setConf may be invalid.
right. I have missed that IntegerRanges is limited to positive integers. I
think we should make IntRangeFilter to extend ComparableRangeFilter instead(not
using IntegerRanges anymore).
bq. RangeFilter should probably accept a WritableComparator to support user
types and an alternative syntax for projections
pls. see below
bq. Would it be too constraining to limit SetFilter and ItemFilter to Text?
Filtering based on the string representation of Writables seems like an overly
general strategy.
The main reason behind this weird strategy of using string comparison on
serialized versions of the Writables is that we should somehow pass the
specified writables(for example min and max values) to the tasks, and currently
the only way for this is to store them in the configuration. It would be great
if we have setWritable() and getWritable() methods in the configuration, so
that we can then directly compares WritableComparables (possibly using
WritableComparators), however the proposal to add these methods are lazily
rejected (i cannot remember the issue number).
another solution to this may be adding an interface for example
WritableStorable, WritableDeserializer, which will provide
Writable forName(String) method.
If you see a better solution to pass the Writables to the tasks, I will be very
glad to adopt it. Or should we add setWritable() getWritable() to the
Configuration?
> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>
> Key: HADOOP-449
> URL: https://issues.apache.org/jira/browse/HADOOP-449
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.17.0
> Reporter: Owen O'Malley
> Assignee: Enis Soztutar
> Fix For: 0.17.0
>
> Attachments: filterinputformat_v1.patch
>
>
> I'd like to generalize the SequenceFileInputFormat that was introduced in
> HADOOP-412 so that it can be applied to any InputFormat. To do this, I
> propose:
> interface WritableFilter {
> boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
> ...
> }
> FilterInputFormat would look in the JobConf for:
> mapred.input.filter.source = the underlying input format
> mapred.input.filter.filters = a list of class names that implement
> WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an
> internal RecordReader rather than the SequenceFile. This will require adding
> a next(key) and getCurrentValue(value) to the RecordReader interface, but
> that will be addressed in a different issue.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.