[jira] Commented: (HADOOP-449) Generalize the SequenceFileInputFilter to apply to any InputFormat

Chris Douglas (JIRA) Mon, 11 Feb 2008 11:00:35 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567750#action_12567750
 ]


Chris Douglas commented on HADOOP-449:
--------------------------------------

bq. I did not think about the join framework. Having a look at it, i guess we 
can still stick with the current framework.

I think your example would work, but I was considering filters at arbitrary 
positions in the join. I was thinking of adding a new node to the parser that 
accepts a Filter and an argument (the range, the regexp, etc) and sets the 
filter expression prior to the instantiation of the RecordReader (as it does 
for mapred.input.dir). Both should work.

bq. I think current implementation is OK, since we are updating and digesting 
the MessageDigest in only the MD5Hashcode function which is already 
synchronized.

The MD5Hashcode function is synchronized on the instance, but it's protecting a 
static. Unless there's only one instance of the MD5PercentFilter, synchronizing 
on the method is insufficient, no?

bq. I think we better be pragmatic about this one. Lets not spend some 
nontrivial amount of effort on this. We can fix it if it is exploited in some 
way.

*nod* Again, I think it'll be fine for the majority of cases, but I thought I'd 
mention it.

bq. People are expected to read the javadocs before using the classes.

Well, fair enough. Really, it only supports Text, and this seems like a 
convenient way to annotate the class since it's not difficult to effect the 
translation. Further, toString isn't usually considered in the 
Comparable/equals/hashCode family of equality, so it seems risky.

> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>
>                 Key: HADOOP-449
>                 URL: https://issues.apache.org/jira/browse/HADOOP-449
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Owen O'Malley
>            Assignee: Enis Soztutar
>             Fix For: 0.17.0
>
>         Attachments: filterinputformat_v1.patch
>
>
> I'd like to generalize the SequenceFileInputFormat that was introduced in 
> HADOOP-412 so that it can be applied to any InputFormat. To do this, I 
> propose:
> interface WritableFilter {
>    boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
>   ...
> }
> FilterInputFormat would look in the JobConf for:
>    mapred.input.filter.source = the underlying input format
>    mapred.input.filter.filters = a list of class names that implement 
> WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an 
> internal RecordReader rather than the SequenceFile. This will require adding 
> a next(key) and getCurrentValue(value) to the RecordReader interface, but 
> that will be addressed in a different issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-449) Generalize the SequenceFileInputFilter to apply to any InputFormat

Reply via email to