[ 
https://issues.apache.org/jira/browse/CRUNCH-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830127#comment-13830127
 ] 

Gabriel Reid commented on CRUNCH-299:
-------------------------------------

FWIW, option A (giving a ColumnRecordFilter to Parquet Source) is in line with 
what we do in HBase right now, i.e. you can provide a Scan object that does 
whatever kind of filtering you want.

I can imagine that if we had some kind of common FieldValuePredicateFilterFn 
class, we could potentially push it down (or up) to the source, which could 
choose to attempt to use it, something like this:

{code}
    filteredCollection = collection.filter(new 
FieldValuePredicateFilterFn("make", eq("Volkswagen"));
{code}

This could then pushed up to the source and be interpreted by an HBaseSource as 
"add an equality filter to the scan on values of the 'make' column family", and 
interpreted by the Parquet Source as "create a ColumnRecordFilter on the 'make' 
column". Obviously in the cases of other sources (e.g. text) it would just be 
ignored, and the filter could be executed as usual (which I guess would mean 
using reflection to extract the field value). There are cases where that 
wouldn't work well that I can think of, and probably a lot more that I can't 
think of. Stuff like this is probably a lot easier in cases like Pig, Hive, and 
Cascading where you know that the values passing through the pipeline are all 
tuples.

That being said, I think that this also opens the discussion of how "smart" we 
want Crunch to be, or how much we want to leave optimization things like that 
up to the user. A similar discussion is the idea of letting Crunch 
automatically choose a join strategy based on observations about the data.



> Support predicate pushdown for Parquet sources
> ----------------------------------------------
>
>                 Key: CRUNCH-299
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-299
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Tom White
>            Assignee: Josh Wills
>
> We should be able to push Crunch FilterFn down to a Parquet 
> ColumnRecordFilter. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to