[
https://issues.apache.org/jira/browse/CRUNCH-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830127#comment-13830127
]
Gabriel Reid commented on CRUNCH-299:
-------------------------------------
FWIW, option A (giving a ColumnRecordFilter to Parquet Source) is in line with
what we do in HBase right now, i.e. you can provide a Scan object that does
whatever kind of filtering you want.
I can imagine that if we had some kind of common FieldValuePredicateFilterFn
class, we could potentially push it down (or up) to the source, which could
choose to attempt to use it, something like this:
{code}
filteredCollection = collection.filter(new
FieldValuePredicateFilterFn("make", eq("Volkswagen"));
{code}
This could then pushed up to the source and be interpreted by an HBaseSource as
"add an equality filter to the scan on values of the 'make' column family", and
interpreted by the Parquet Source as "create a ColumnRecordFilter on the 'make'
column". Obviously in the cases of other sources (e.g. text) it would just be
ignored, and the filter could be executed as usual (which I guess would mean
using reflection to extract the field value). There are cases where that
wouldn't work well that I can think of, and probably a lot more that I can't
think of. Stuff like this is probably a lot easier in cases like Pig, Hive, and
Cascading where you know that the values passing through the pipeline are all
tuples.
That being said, I think that this also opens the discussion of how "smart" we
want Crunch to be, or how much we want to leave optimization things like that
up to the user. A similar discussion is the idea of letting Crunch
automatically choose a join strategy based on observations about the data.
> Support predicate pushdown for Parquet sources
> ----------------------------------------------
>
> Key: CRUNCH-299
> URL: https://issues.apache.org/jira/browse/CRUNCH-299
> Project: Crunch
> Issue Type: Improvement
> Components: Core
> Reporter: Tom White
> Assignee: Josh Wills
>
> We should be able to push Crunch FilterFn down to a Parquet
> ColumnRecordFilter.
--
This message was sent by Atlassian JIRA
(v6.1#6144)