[jira] Commented: (HADOOP-449) Generalize the SequenceFileInputFilter to apply to any InputFormat

Enis Soztutar (JIRA) Thu, 20 Mar 2008 11:39:21 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580886#action_12580886
 ]


Enis Soztutar commented on HADOOP-449:
--------------------------------------

bq. Wow, looks like quite a rewrite! A few points:
Indeed it is *smile*

bq. Would it be possible to separate the Stringify functionality into a 
separate JIRA? It is independently useful functionality and would be easier to 
review if it were extracted from an already formidable patch. I like this idea.
Done in HADOOP-3048. 
bq. Your previous version- where this was library, not framework code- is 
preferred. The FilterRecordReader should be part of a FilterInputFormat, which 
is- as you illustrated with your earlier patch- not an undue burden on a user. 
Weaving this into Task and MapTask doesn't seem to add any additional 
functionality or syntactic elegance. You could add counters to the library code 
without changing core functionality. Being part of Task also prevents it from 
being integrated into other libraries (like the aforementioned join framework).
Implementing this functionality as a library was good, and as you mentioned, 
there are lots of benefits for doing so. However after spending some time on 
this I realized implementing this in the core framework would be even better, 
because : 
 * Theoretically *every* job can/should use the filtering functionality, since 
there is no drawback but lots of benefits. So this necessitates that the 
InputFormats of *every* job should be FilterInputFormat, shading the real 
InputFormat under FilterInputFormat#setBaseInputFormat(). 
 * There is a lot of legacy code which can benefit from this, but people will 
be reluctant (or lazy) to convert their job's input format to filter. So 
maximum usability and minimum code change should be aimed. 
 * Although the functionality is at the core, we only change a few lines(except 
FilterRR) from the Task and MapTask classes, effectively encapsulating the 
functionality. We may extract FilterRecordReader to its own class, so that it 
is completely separate. I should note that join can readily use filtering. The 
filtering just filters before passing the record to the mapper, so the joined 
keys would be filtered. 

At the risk of repeating myself, I believe that every InputFormat of every job 
should be FilterInputFormat, which is why we should integrate this to the core. 
However I am open to any suggestions and discussions. *smile*

bq. There are no progress updates as you churn through records in your 
filtering engine. You might want to keep a reporter around to prevent timeouts.
I will check that
bq. It would be much better if one could employ a (Raw)Comparator instead of 
using compareTo in RangeFilter and equals in ItemFilter. Using 
JobConf::getOutputKeyComparator would be a good start, though one could imagine 
uses where one would want different comparators for different filters.
I will check that
bq.  It is unnecessary to distinguish between Filters and FunctionFilters. A 
single eval method accepting the key and stack is sufficient (and permits 
functions like AND and OR to short-circuit, somewhat). Forcing the application 
of every filter should probably be avoided when possible (see next bullet).
Yes you're right (indeed I first did that way). However it is very unlikely 
that a user may implement a FunctionFilter, but it is quite likely that she can 
implement a Filter. Thus adding a stack argument that no filter uses seems 
confusing and unnecessary. Consider the javadoc for the stack argument in the 
Filter#accept() method being as "@param stack filters should not use this". 

bq. FilterEngine uses a static method and int to track the insertion order of 
filters and their associated properties, which is probably not the correct 
approach. Consider a case where a client submits two jobs, both with filters. 
Given that you require the user to add their expression in postfix notation, it 
doesn't seem unduly onerous to require them to track the position of their 
filters and avoid keeping a count. That said, assigning sequential identifiers 
to your filter parameters is probably not an ideal approach in itself. You 
might consider writing a parser or creating a Stringify-able object hierarchy 
(as in Swing, sort of) for your expressions. Something that looks more like an 
expression parse tree would probably effect a more efficient data structure.
I also think that the best way to do the filtering was creating the object 
hierarchy, then pass it to the configuration, but somehow I did the postfix 
expressions way. I should think a better way I guess. I guess serializing the 
filters -> stringify -> store in conf -> pass to Task -> load from conf -> 
destringify -> deserialize might work. 

bq. Why is the MessageDigest static in MD5PercentFilter? Unless it is expensive 
to create, the synchronization will probably cost more than making this an 
instance variable.
I just took the implementation MessageDigest$MD5PercentFilter, which used 
static digest, and I assumed that getInstance() returns the singleton instance 
of the object. However it seems it is not the case. I will fix this. 

bq. Well done! I look forward to this.
Me too, thanks for the in depth review Chris.



> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>
>                 Key: HADOOP-449
>                 URL: https://issues.apache.org/jira/browse/HADOOP-449
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Owen O'Malley
>            Assignee: Enis Soztutar
>             Fix For: 0.17.0
>
>         Attachments: filtering_v2.patch, filtering_v3.patch, 
> filterinputformat_v1.patch
>
>
> I'd like to generalize the SequenceFileInputFormat that was introduced in 
> HADOOP-412 so that it can be applied to any InputFormat. To do this, I 
> propose:
> interface WritableFilter {
>    boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
>   ...
> }
> FilterInputFormat would look in the JobConf for:
>    mapred.input.filter.source = the underlying input format
>    mapred.input.filter.filters = a list of class names that implement 
> WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an 
> internal RecordReader rather than the SequenceFile. This will require adding 
> a next(key) and getCurrentValue(value) to the RecordReader interface, but 
> that will be addressed in a different issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-449) Generalize the SequenceFileInputFilter to apply to any InputFormat

Reply via email to