[
https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Enis Soztutar updated HADOOP-449:
---------------------------------
Attachment: filtering_v2.patch
After spending some time on thinking about his patch, I have redesigned the
API. The changes are :
* Refactored WritableFilter to Filter, so that Filter can be applied to
non-Writables (according to Serialization framework)
* Added a Stringifier interface and a Default implementation using hadoop
serialization framework. Now ordinary objects can be kept in the configuration.
Acknowledging the performance loss in String.equals() comparison, we had to
pass the actual objects in the configuration, or not use filtering at all.
* Added FilterEngine to evaluate postfix filter expressions
* Added OR, AND, NOT Filters
* Fixed synchronization issue in MessageDigest
* Filtering is moved to core framework instead of a library.
* Changed the API so that JobConf is now used to add filters. This API is
better since it hides nearly all the details from the appliaction code. The
applications just configures the filter by calling JobConf#addFilter().
* Added a counter for filtered-out records
* Added filtering section to the mapred tutorial.
> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>
> Key: HADOOP-449
> URL: https://issues.apache.org/jira/browse/HADOOP-449
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.17.0
> Reporter: Owen O'Malley
> Assignee: Enis Soztutar
> Fix For: 0.17.0
>
> Attachments: filtering_v2.patch, filterinputformat_v1.patch
>
>
> I'd like to generalize the SequenceFileInputFormat that was introduced in
> HADOOP-412 so that it can be applied to any InputFormat. To do this, I
> propose:
> interface WritableFilter {
> boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
> ...
> }
> FilterInputFormat would look in the JobConf for:
> mapred.input.filter.source = the underlying input format
> mapred.input.filter.filters = a list of class names that implement
> WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an
> internal RecordReader rather than the SequenceFile. This will require adding
> a next(key) and getCurrentValue(value) to the RecordReader interface, but
> that will be addressed in a different issue.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.