[jira] Commented: (HADOOP-449) Generalize the SequenceFileInputFilter to apply to any InputFormat

Chris Douglas (JIRA) Thu, 27 Mar 2008 17:06:43 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582845#action_12582845
 ]


Chris Douglas commented on HADOOP-449:
--------------------------------------

bq. will change the postfix expressions and develop a more intuitive way...

+1 I like this syntax. Since you're passing serialized objects with your 
filters, you might want to test larger expressions to make sure length limits 
in the Configuration aren't a problem, but hitting them seems unlikely. I don't 
know if we even have limits in that area, but again: it'd be worth testing. On 
that note, for the FunctionFilters you've defined, it might be a good idea to 
permit them to take an arbitrary number of arguments >=2, as in:

{noformat}
Filter f1 = new RangeFilter(2, 5);
Filter f2 = new RangeFilter(10, 20);
Filter f3 = new RangeFilter(30, 40);
Filter orFilter = new ORFilter(f1, f2, f3);
{noformat}

With your new syntax, this would be both easy to implement, presents more 
opportunities for optimization within your FilterEngine, and is very convenient 
for users.

bq. Having one eval method is a cleaner interface to core developers who could 
understand how the postfix expression is evaluated...

Isn't all of this hidden by the FilterEngine? I'm not sure I understand what 
you're asserting in this paragraph... I thought we were discussing whether or 
not it made sense to collapse Filters and FunctionFilters into a single Filter 
interface that manipulates the key/stack. By construction, you know that your 
FunctionFilters have either Filters or FunctionFilters as children. Once you 
reconstruct the tree, it's not clear to me why you'd even need a stack. The key 
gets passed through your tree to the child Filters, which return results to the 
parent, which may or may not pass the key to its other children depending on 
the return value. It might make sense to have a FunctionFilter base type from 
which your operators descend- since they share common functionality- but the 
additional interface seems unnecessary. Have I misunderstood you, or am I 
responding to your new syntax instead of the original, postfix, stack-based 
implementation?

bq. The postfix additions is irrelevant to whether filtering should be a 
library or not. The postfix expressions are a way to specify the filtering 
expression to use, that part of the API will not be changed if we had sticked 
with FilterInputFormat.

Sorry, I was unclear. You're right, the postfix syntax is orthogonal to this 
discussion since that functionality wasn't present in the original patch. I was 
only pointing out that those who could benefit from Filters aren't going to be 
turned away because they need to use a different InputFormat, i.e. using the 
library poses a more familiar and less difficult problem to users than the 
syntax and implications of Filters.

bq. [library vs core in general]

It cannot be disputed that your integration of Filters into Tasks has a 
negligible cost and that it does not prohibit its use elsewhere and in other 
frameworks. That said, the semantics of Filters match those of InputFormat 
precisely. At its point of integration, it does precisely what an InputFormat 
would effect (with one caveat concerning map counters). It also avoids any 
confusion about where the filtering occurs, particularly where other decorator 
InputFormats are applied. Though I'm sympathetic to making Filtering part of 
every job, setting the InputFormat seems like a modest burden that happens to 
also fit the existing semantics in an intuitive and efficient way.

> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>
>                 Key: HADOOP-449
>                 URL: https://issues.apache.org/jira/browse/HADOOP-449
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Owen O'Malley
>            Assignee: Enis Soztutar
>             Fix For: 0.17.0
>
>         Attachments: filtering_v2.patch, filtering_v3.patch, 
> filterinputformat_v1.patch
>
>
> I'd like to generalize the SequenceFileInputFormat that was introduced in 
> HADOOP-412 so that it can be applied to any InputFormat. To do this, I 
> propose:
> interface WritableFilter {
>    boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
>   ...
> }
> FilterInputFormat would look in the JobConf for:
>    mapred.input.filter.source = the underlying input format
>    mapred.input.filter.filters = a list of class names that implement 
> WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an 
> internal RecordReader rather than the SequenceFile. This will require adding 
> a next(key) and getCurrentValue(value) to the RecordReader interface, but 
> that will be addressed in a different issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-449) Generalize the SequenceFileInputFilter to apply to any InputFormat

Reply via email to