[
https://issues.apache.org/jira/browse/CRUNCH-336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887884#comment-13887884
]
Micah Whitacre commented on CRUNCH-336:
---------------------------------------
CRUNCH-299 was logged to track something similar to "item 1" but by
interpreting a FilterFn as a RecordFilter. I'm leaning to agree with [~jwills]
and [~gabriel.reid] that we should just expose those options at source creation.
> Optimized filters and joins via Parquet RecordFilters
> -----------------------------------------------------
>
> Key: CRUNCH-336
> URL: https://issues.apache.org/jira/browse/CRUNCH-336
> Project: Crunch
> Issue Type: Improvement
> Reporter: Ryan Brush
>
> Logging this to track some ideas from an offline discussion with [~jwills]
> and [~mkwhitacre]. There's an opportunity to significantly speed up a couple
> access patterns:
> 1. Process only a subset of data from a Parquet file identified by a single
> column
> 2. Perform a bloom filter join between two datasets, where the joined item is
> a Parquet column in the larger data set.
> Optimizing item 1 simply involves using a RecordFilter to narrow down the
> data loaded from the AvroParquetInputFormat.
> Optimizing item 2 is more involved. In a nutshell, we discussed doing a bloom
> filter join, but using the bloom filter to implement the Parquet RecordFilter
> on the specific column. In cases where where we join on columns and only
> select a small subset of the larger dataset, this would skip IO and
> deserialization cost for all items that didn't match the join.
> It's not obvious to me how we'd achieve this cleanly, since it involves
> multiple pieces (configuring of inputs in conjunction with a specific join
> strategy). In many cases the bloom filter join alone will achieve sufficient
> performance, but I'm logging this potential optimization for reference.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)