Hi Reynold,

Parquet uses the same predicates that are passed to the reader (via
withFilter [1]) for both record-level and row group filtering. We've found
that the main benefit is when they can be used to eliminate entire row
groups.

What bugs have you found? I've not seen problems with the filtering done by
Parquet so I'm surprised that you guys have seen so many (presumably that
you've tracked to Parquet push-down?) that it doesn't seem worth it.

Both record and row group filtering use the same predicates. Record
filtering evaluates a predicate using an assembled record, so it is
probably slower than filtering in Spark SQL. This is faster for engines
like Pig that don't have vectorized reads and would have additional calls
on top of the Parquet layer. Also, the 2.0 spec makes it possible to filter
individual data pages, but this hasn't been implemented.

In contrast to record-level, row group filtering is *very* valuable when
data is correctly prepared. We have datasets where row group filtering gets
us a 20-100x speedup (measured in Pig, Presto, and Spark) because we only
need to read 1% of the data. This uses column-level stats from the footer
and dictionaries to eliminate row groups that can't satisfy the query
predicate. For example, for a column with min=5, max=26 and a predicate x <
0, we know that there are no matching values. Similarly, we can look at a
dictionary and see all of the possible values and eliminate a row group if
none of them match the predicate. Row group filtering works best with the
data sorted within partitions by common query columns.

rb

[1]:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L190

On Wed, Jul 6, 2016 at 11:13 AM, Reynold Xin <[email protected]> wrote:

> Among the people working on Spark there are a lot of confusions about what
> Parquet's filter pushdown actually accomplishes. Depending on who I talk
> to, I get "it filters rows one by one" or "it skips blocks via min/max
> value tracking". Can I get a more official response on this?
>
> The reason I'm asking is that we have seen so many bugs related to filter
> pushdown (either bugs in Parquet, or bugs in Spark's implementation of it)
> that we are considering just permanently disabling filter pushdown, if the
> performance gain is not enormous.
>
> Let me know. Thanks.
>



-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to