[jira] Commented: (HADOOP-449) Generalize the SequenceFileInputFilter to apply to any InputFormat

Chris Douglas (JIRA) Tue, 18 Mar 2008 15:40:14 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580149#action_12580149
 ]


Chris Douglas commented on HADOOP-449:
--------------------------------------

Wow, looks like quite a rewrite! A few points:
* Would it be possible to separate the Stringify functionality into a separate 
JIRA? It is independently useful functionality and would be easier to review if 
it were extracted from an already formidable patch. I like this idea.
** DefaultStringifier could reuse more of the objects it creates. The 
serializer will hold a reference to the ByteArray(Input/Output)Buffers you've 
defined, so re-opening and closing them every time is unnecessary if you use a 
resetable stream like o.a.h.io.Data(Input/Output)Buffer.
** The static convenience methods (load, store, storeArray) that create new 
Stringifiers in DefaultStringifier are probably unnecessary. Adding a 
stringified object to a config doesn't need that much support. Further, it's 
not clear to me how one would register a Stringifier that isn't a 
DefaultStringifier. Would it make sense to follow the pattern in the 
Serialization framework from HADOOP-1986- or the WritableComparator static 
initialization- and register custom Stringifiers for classes?
* Your previous version- where this was library, not framework code- is 
preferred. The FilterRecordReader should be part of a FilterInputFormat, which 
is- as you illustrated with your earlier patch- not an undue burden on a user. 
Weaving this into Task and MapTask doesn't seem to add any additional 
functionality or syntactic elegance. You could add counters to the library code 
without changing core functionality. Being part of Task also prevents it from 
being integrated into other libraries (like the aforementioned join framework).
* There are no progress updates as you churn through records in your filtering 
engine. You might want to keep a reporter around to prevent timeouts.
* It would be much better if one could employ a (Raw)Comparator instead of 
using compareTo in RangeFilter and equals in ItemFilter. Using 
JobConf::getOutputKeyComparator would be a good start, though one could imagine 
uses where one would want different comparators for different filters.
* It is unnecessary to distinguish between Filters and FunctionFilters. A 
single eval method accepting the key and stack is sufficient (and permits 
functions like AND and OR to short-circuit, somewhat). Forcing the application 
of every filter should probably be avoided when possible (see next bullet).
* FilterEngine uses a static method and int to track the insertion order of 
filters and their associated properties, which is probably not the correct 
approach. Consider a case where a client submits two jobs, both with filters. 
Given that you require the user to add their expression in postfix notation, it 
doesn't seem unduly onerous to require them to track the position of their 
filters and avoid keeping a count. That said, assigning sequential identifiers 
to your filter parameters is probably not an ideal approach in itself. You 
might consider writing a parser or creating a Stringify-able object hierarchy 
(as in Swing, sort of) for your expressions. Something that looks more like an 
expression parse tree would probably effect a more efficient data structure.
* Why is the MessageDigest static in MD5PercentFilter? Unless it is expensive 
to create, the synchronization will probably cost more than making this an 
instance variable.

Well done! I look forward to this.

> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>
>                 Key: HADOOP-449
>                 URL: https://issues.apache.org/jira/browse/HADOOP-449
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Owen O'Malley
>            Assignee: Enis Soztutar
>             Fix For: 0.17.0
>
>         Attachments: filtering_v2.patch, filtering_v3.patch, 
> filterinputformat_v1.patch
>
>
> I'd like to generalize the SequenceFileInputFormat that was introduced in 
> HADOOP-412 so that it can be applied to any InputFormat. To do this, I 
> propose:
> interface WritableFilter {
>    boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
>   ...
> }
> FilterInputFormat would look in the JobConf for:
>    mapred.input.filter.source = the underlying input format
>    mapred.input.filter.filters = a list of class names that implement 
> WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an 
> internal RecordReader rather than the SequenceFile. This will require adding 
> a next(key) and getCurrentValue(value) to the RecordReader interface, but 
> that will be addressed in a different issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-449) Generalize the SequenceFileInputFilter to apply to any InputFormat

Reply via email to