[
https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580149#action_12580149
]
Chris Douglas commented on HADOOP-449:
--------------------------------------
Wow, looks like quite a rewrite! A few points:
* Would it be possible to separate the Stringify functionality into a separate
JIRA? It is independently useful functionality and would be easier to review if
it were extracted from an already formidable patch. I like this idea.
** DefaultStringifier could reuse more of the objects it creates. The
serializer will hold a reference to the ByteArray(Input/Output)Buffers you've
defined, so re-opening and closing them every time is unnecessary if you use a
resetable stream like o.a.h.io.Data(Input/Output)Buffer.
** The static convenience methods (load, store, storeArray) that create new
Stringifiers in DefaultStringifier are probably unnecessary. Adding a
stringified object to a config doesn't need that much support. Further, it's
not clear to me how one would register a Stringifier that isn't a
DefaultStringifier. Would it make sense to follow the pattern in the
Serialization framework from HADOOP-1986- or the WritableComparator static
initialization- and register custom Stringifiers for classes?
* Your previous version- where this was library, not framework code- is
preferred. The FilterRecordReader should be part of a FilterInputFormat, which
is- as you illustrated with your earlier patch- not an undue burden on a user.
Weaving this into Task and MapTask doesn't seem to add any additional
functionality or syntactic elegance. You could add counters to the library code
without changing core functionality. Being part of Task also prevents it from
being integrated into other libraries (like the aforementioned join framework).
* There are no progress updates as you churn through records in your filtering
engine. You might want to keep a reporter around to prevent timeouts.
* It would be much better if one could employ a (Raw)Comparator instead of
using compareTo in RangeFilter and equals in ItemFilter. Using
JobConf::getOutputKeyComparator would be a good start, though one could imagine
uses where one would want different comparators for different filters.
* It is unnecessary to distinguish between Filters and FunctionFilters. A
single eval method accepting the key and stack is sufficient (and permits
functions like AND and OR to short-circuit, somewhat). Forcing the application
of every filter should probably be avoided when possible (see next bullet).
* FilterEngine uses a static method and int to track the insertion order of
filters and their associated properties, which is probably not the correct
approach. Consider a case where a client submits two jobs, both with filters.
Given that you require the user to add their expression in postfix notation, it
doesn't seem unduly onerous to require them to track the position of their
filters and avoid keeping a count. That said, assigning sequential identifiers
to your filter parameters is probably not an ideal approach in itself. You
might consider writing a parser or creating a Stringify-able object hierarchy
(as in Swing, sort of) for your expressions. Something that looks more like an
expression parse tree would probably effect a more efficient data structure.
* Why is the MessageDigest static in MD5PercentFilter? Unless it is expensive
to create, the synchronization will probably cost more than making this an
instance variable.
Well done! I look forward to this.
> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>
> Key: HADOOP-449
> URL: https://issues.apache.org/jira/browse/HADOOP-449
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.17.0
> Reporter: Owen O'Malley
> Assignee: Enis Soztutar
> Fix For: 0.17.0
>
> Attachments: filtering_v2.patch, filtering_v3.patch,
> filterinputformat_v1.patch
>
>
> I'd like to generalize the SequenceFileInputFormat that was introduced in
> HADOOP-412 so that it can be applied to any InputFormat. To do this, I
> propose:
> interface WritableFilter {
> boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
> ...
> }
> FilterInputFormat would look in the JobConf for:
> mapred.input.filter.source = the underlying input format
> mapred.input.filter.filters = a list of class names that implement
> WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an
> internal RecordReader rather than the SequenceFile. This will require adding
> a next(key) and getCurrentValue(value) to the RecordReader interface, but
> that will be addressed in a different issue.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.