[
https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580886#action_12580886
]
Enis Soztutar commented on HADOOP-449:
--------------------------------------
bq. Wow, looks like quite a rewrite! A few points:
Indeed it is *smile*
bq. Would it be possible to separate the Stringify functionality into a
separate JIRA? It is independently useful functionality and would be easier to
review if it were extracted from an already formidable patch. I like this idea.
Done in HADOOP-3048.
bq. Your previous version- where this was library, not framework code- is
preferred. The FilterRecordReader should be part of a FilterInputFormat, which
is- as you illustrated with your earlier patch- not an undue burden on a user.
Weaving this into Task and MapTask doesn't seem to add any additional
functionality or syntactic elegance. You could add counters to the library code
without changing core functionality. Being part of Task also prevents it from
being integrated into other libraries (like the aforementioned join framework).
Implementing this functionality as a library was good, and as you mentioned,
there are lots of benefits for doing so. However after spending some time on
this I realized implementing this in the core framework would be even better,
because :
* Theoretically *every* job can/should use the filtering functionality, since
there is no drawback but lots of benefits. So this necessitates that the
InputFormats of *every* job should be FilterInputFormat, shading the real
InputFormat under FilterInputFormat#setBaseInputFormat().
* There is a lot of legacy code which can benefit from this, but people will
be reluctant (or lazy) to convert their job's input format to filter. So
maximum usability and minimum code change should be aimed.
* Although the functionality is at the core, we only change a few lines(except
FilterRR) from the Task and MapTask classes, effectively encapsulating the
functionality. We may extract FilterRecordReader to its own class, so that it
is completely separate. I should note that join can readily use filtering. The
filtering just filters before passing the record to the mapper, so the joined
keys would be filtered.
At the risk of repeating myself, I believe that every InputFormat of every job
should be FilterInputFormat, which is why we should integrate this to the core.
However I am open to any suggestions and discussions. *smile*
bq. There are no progress updates as you churn through records in your
filtering engine. You might want to keep a reporter around to prevent timeouts.
I will check that
bq. It would be much better if one could employ a (Raw)Comparator instead of
using compareTo in RangeFilter and equals in ItemFilter. Using
JobConf::getOutputKeyComparator would be a good start, though one could imagine
uses where one would want different comparators for different filters.
I will check that
bq. It is unnecessary to distinguish between Filters and FunctionFilters. A
single eval method accepting the key and stack is sufficient (and permits
functions like AND and OR to short-circuit, somewhat). Forcing the application
of every filter should probably be avoided when possible (see next bullet).
Yes you're right (indeed I first did that way). However it is very unlikely
that a user may implement a FunctionFilter, but it is quite likely that she can
implement a Filter. Thus adding a stack argument that no filter uses seems
confusing and unnecessary. Consider the javadoc for the stack argument in the
Filter#accept() method being as "@param stack filters should not use this".
bq. FilterEngine uses a static method and int to track the insertion order of
filters and their associated properties, which is probably not the correct
approach. Consider a case where a client submits two jobs, both with filters.
Given that you require the user to add their expression in postfix notation, it
doesn't seem unduly onerous to require them to track the position of their
filters and avoid keeping a count. That said, assigning sequential identifiers
to your filter parameters is probably not an ideal approach in itself. You
might consider writing a parser or creating a Stringify-able object hierarchy
(as in Swing, sort of) for your expressions. Something that looks more like an
expression parse tree would probably effect a more efficient data structure.
I also think that the best way to do the filtering was creating the object
hierarchy, then pass it to the configuration, but somehow I did the postfix
expressions way. I should think a better way I guess. I guess serializing the
filters -> stringify -> store in conf -> pass to Task -> load from conf ->
destringify -> deserialize might work.
bq. Why is the MessageDigest static in MD5PercentFilter? Unless it is expensive
to create, the synchronization will probably cost more than making this an
instance variable.
I just took the implementation MessageDigest$MD5PercentFilter, which used
static digest, and I assumed that getInstance() returns the singleton instance
of the object. However it seems it is not the case. I will fix this.
bq. Well done! I look forward to this.
Me too, thanks for the in depth review Chris.
> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>
> Key: HADOOP-449
> URL: https://issues.apache.org/jira/browse/HADOOP-449
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.17.0
> Reporter: Owen O'Malley
> Assignee: Enis Soztutar
> Fix For: 0.17.0
>
> Attachments: filtering_v2.patch, filtering_v3.patch,
> filterinputformat_v1.patch
>
>
> I'd like to generalize the SequenceFileInputFormat that was introduced in
> HADOOP-412 so that it can be applied to any InputFormat. To do this, I
> propose:
> interface WritableFilter {
> boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
> ...
> }
> FilterInputFormat would look in the JobConf for:
> mapred.input.filter.source = the underlying input format
> mapred.input.filter.filters = a list of class names that implement
> WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an
> internal RecordReader rather than the SequenceFile. This will require adding
> a next(key) and getCurrentValue(value) to the RecordReader interface, but
> that will be addressed in a different issue.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.