[ 
https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581971#action_12581971
 ] 

Enis Soztutar commented on HADOOP-449:
--------------------------------------

bq. Even if every job can use the filtering functionality, integrating it into 
Task/MapTask limits where it may be applied. If, for example, one were reading 
from multiple sources, different sets of filters could be applied to each 
source. Similarly, a map or a reduce task could use a filtering record reader 
to read a subset of records indirectly. If it's limited to the interfaces you 
provide to MapTask, then this code can't be reused elsewhere. Again, since 
weaving it into core doesn't seem to give you extra functionality- it seems to 
make it less general- and there's zero performance hit, making it a library 
looks laced with win.
please see below
bq. Not exactly. If I apply a RangeFilter to each of my record readers, the 
join considers a smaller subset of the records read. Since it's generating the 
cross of all the matching records (i.e. sets A, B, C sharing key k and 
containing values x1, x2 would emit [a1, b1, c1], [a1, b1, c2], [a1, b2, c1], 
... [a2, b2, c2]), my filter would have to reject the cross of all those 
records, rather than each individually. Further, if I only want to filter the 
records from B in the previous example, the filter in my map would need more 
state to ensure I'm not emitting duplicate records to the map (or my map code 
would have to deal with that). One can imagine other cases where, again, 
filtering shouldn't be limited to a single part of the job, or cases where it 
might change the result if filters can only be applied at a certain stage.

What we discuss, differs in just one part : 
As a library : 
  - FilterRecordReader same as before
  - classes under io.filter - same as before
  - FilterInputFormat as a thin wrapper 
  - Configure filters from FilterInputFormat (delegating to FilterEngine)
  - Users may freely extend / use FilterInputFormat, FilterRecordReader, 
Filter's. 
As integrated to the core 
  - FilterRecordReader same as before
  - classes under io.filter - same as before
  - instead of FilterInputFormat, use FilterRecordReader wrapping 
MapTask.TrackedRecordReader if filtering is enabled. 
  - Configure filters from JobConf. (delegating to FilterEngine)
  - Users may freely extend / use FilterRecordReader, Filter's. 

The bulk of the implementation is in the FilterRecordReader, FilterEngine and 
Filter implementations, thus this code(and filtering functionality) CAN be 
reused. Moreover FilterInputFormat from the previous patch is only a wrapper. 
Assuming we move out FilterRecordReader to a separate class, join framework can 
easily extend its grammar to accept an InputFormat which filters records from 
its underlying inputformat (but this is a different issue ). In this sense, the 
current patch does not "limit" applicability of filtering to other parts of the 
system, such as using more than one inputFormats, filtering map output results 
etc, but it enables a filtering framework and enables a frequent use case in 
which we filter input records to the job. The other use cases can indeed be 
implemented on top of the patch (in the case of join, it can bypass the core 
filter, and introduce filtering at another layer). 

bq. I disagree, and I cite your previous patch. Its interface was not only 
easier to understand than the postfix additions, but specifying the 
baseInputFormat was very intuitive. For users seeking to benefit from this, the 
difficulty delta between the library and Task implementations is so slight that 
I doubt it'll actually prevent someone from taking advantage of it.
The postfix additions is irrelevant to whether filtering should be a library or 
not. The postfix expressions are a way to specify the filtering expression to 
use, that part of the API will not be changed if we had sticked with 
FilterInputFormat.  

bq. Though Filters will be useful, the semantics of a FunctionFilter aren't so 
mysterious that people won't want to write those, too. Again, the purpose of 
both parameters is easily explained, and people will decide whether they should 
employ them or not. It seems premature to decide that there are only two types 
of filters, anyway. It sounds like we agree that it's a cleaner interface with 
only one signature for the eval; I'm just not sure I see the extensibility 
benefit as clearly.
I have separated Filter and FunctionFilter interfaces to enhance encapsulation, 
not extensibility. Having one eval method is a cleaner interface to core 
developers who could understand how the postfix expression is evaluated using 
Dijkstra's algorithm in the FilterEngine class, but it is a cleaner interface 
to hide all this details from the user if she just  wants to develop a Filter. 
Think about 3 + 4 or in postfix 3 4 +. The 3 and 4 are values, and + is a 
function. Clearly, the Filters(which are values) and FunctionFilters(which are 
functions) are different in that case.  However 3,4 and + are all represented 
as symbols so that they can be interpreted and passed to a machine evaluating 
the expression. Similarly FunctionFilter implements Filter(although it is not a 
filter) to be passed to the FilterEngine. As most of the users will only 
implements Filters, and there is only one boolean binary function (which is 
XOR) which is not implemented, I see no loss of generality in hiding the gory 
details of FunctionFilters. 
Maybe the best way was to construct the hierarchy :
{code}
interface ExpressionItem;
interface Filter extends ExpressionItem;
interface Function extends ExpressionItem;
FilterEngine.add(ExpressionItem item);
{code}
But had chosen not to do this, to simplify things. Anyway as I said before, now 
that passing complete objects are imaginable, thanks to HADOOP-3048, I will 
change the postfix expressions and develop a more intuitive way : 
{code}
Filter f1 = new RangeFilter(2, 5);
Filter f2 = new RangeFilter (10, 20);
Filter orFilter = new ORFilter(f1, f2);
Job.addFilter(orFilter);
{code}

Well, Chris I share your wisdom about the patch being a library, but I only 
insist because I [still] think that this way might be better. Other than that, 
since the implicit votes are 2v1, I will probably implement the next version of 
the patch as a library including FilterInputFormat, unless someone objects. 


> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>
>                 Key: HADOOP-449
>                 URL: https://issues.apache.org/jira/browse/HADOOP-449
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Owen O'Malley
>            Assignee: Enis Soztutar
>             Fix For: 0.17.0
>
>         Attachments: filtering_v2.patch, filtering_v3.patch, 
> filterinputformat_v1.patch
>
>
> I'd like to generalize the SequenceFileInputFormat that was introduced in 
> HADOOP-412 so that it can be applied to any InputFormat. To do this, I 
> propose:
> interface WritableFilter {
>    boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
>   ...
> }
> FilterInputFormat would look in the JobConf for:
>    mapred.input.filter.source = the underlying input format
>    mapred.input.filter.filters = a list of class names that implement 
> WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an 
> internal RecordReader rather than the SequenceFile. This will require adding 
> a next(key) and getCurrentValue(value) to the RecordReader interface, but 
> that will be addressed in a different issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to