[ 
https://issues.apache.org/jira/browse/CRUNCH-336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887884#comment-13887884
 ] 

Micah Whitacre commented on CRUNCH-336:
---------------------------------------

CRUNCH-299 was logged to track something similar to "item 1" but by 
interpreting a FilterFn as a RecordFilter.  I'm leaning to agree with [~jwills] 
and [~gabriel.reid] that we should just expose those options at source creation.

> Optimized filters and joins via Parquet RecordFilters
> -----------------------------------------------------
>
>                 Key: CRUNCH-336
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-336
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Ryan Brush
>
> Logging this to track some ideas from an offline discussion with [~jwills] 
> and [~mkwhitacre]. There's an opportunity to significantly speed up a couple 
> access patterns:
> 1. Process only a subset of data from a Parquet file identified by a single 
> column
> 2. Perform a bloom filter join between two datasets, where the joined item is 
> a Parquet column in the larger data set.
> Optimizing item 1 simply involves using a RecordFilter to narrow down the 
> data loaded from the AvroParquetInputFormat.
> Optimizing item 2 is more involved. In a nutshell, we discussed doing a bloom 
> filter join, but using the bloom filter to implement the Parquet RecordFilter 
> on the specific column. In cases where where we join on columns and only 
> select a small subset of the larger dataset, this would skip IO and 
> deserialization cost for all items that didn't match the join.
> It's not obvious to me how we'd achieve this cleanly, since it involves 
> multiple pieces (configuring of inputs in conjunction with a specific join 
> strategy). In many cases the bloom filter join alone will achieve sufficient 
> performance, but I'm logging this potential optimization for reference.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to