[
https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567750#action_12567750
]
Chris Douglas commented on HADOOP-449:
--------------------------------------
bq. I did not think about the join framework. Having a look at it, i guess we
can still stick with the current framework.
I think your example would work, but I was considering filters at arbitrary
positions in the join. I was thinking of adding a new node to the parser that
accepts a Filter and an argument (the range, the regexp, etc) and sets the
filter expression prior to the instantiation of the RecordReader (as it does
for mapred.input.dir). Both should work.
bq. I think current implementation is OK, since we are updating and digesting
the MessageDigest in only the MD5Hashcode function which is already
synchronized.
The MD5Hashcode function is synchronized on the instance, but it's protecting a
static. Unless there's only one instance of the MD5PercentFilter, synchronizing
on the method is insufficient, no?
bq. I think we better be pragmatic about this one. Lets not spend some
nontrivial amount of effort on this. We can fix it if it is exploited in some
way.
*nod* Again, I think it'll be fine for the majority of cases, but I thought I'd
mention it.
bq. People are expected to read the javadocs before using the classes.
Well, fair enough. Really, it only supports Text, and this seems like a
convenient way to annotate the class since it's not difficult to effect the
translation. Further, toString isn't usually considered in the
Comparable/equals/hashCode family of equality, so it seems risky.
> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>
> Key: HADOOP-449
> URL: https://issues.apache.org/jira/browse/HADOOP-449
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.17.0
> Reporter: Owen O'Malley
> Assignee: Enis Soztutar
> Fix For: 0.17.0
>
> Attachments: filterinputformat_v1.patch
>
>
> I'd like to generalize the SequenceFileInputFormat that was introduced in
> HADOOP-412 so that it can be applied to any InputFormat. To do this, I
> propose:
> interface WritableFilter {
> boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
> ...
> }
> FilterInputFormat would look in the JobConf for:
> mapred.input.filter.source = the underlying input format
> mapred.input.filter.filters = a list of class names that implement
> WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an
> internal RecordReader rather than the SequenceFile. This will require adding
> a next(key) and getCurrentValue(value) to the RecordReader interface, but
> that will be addressed in a different issue.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.