Re: parquet-mr filter pushdown

Cheng Lian Thu, 07 Jul 2016 07:48:07 -0700

One of the commonly seen ETL use cases of Spark is inferring schemaautomatically from JSON datasets and then convert them into Parquet. Insimilar use cases, schema evolution support can be crucial. Reading fromParquet files with different but compatible schemata is quite common.Schema evolution combined with filter push-down can be a source of bugs.PARQUET-389 is an example of this kind of bug. To workaroundPARQUET-389, we made some non-trivial changes in Spark (SPARK-11955),which further lead to SPARK-16371.

From the perspective of performance, I totally agree that row grouplevel filtering is valuable. I think the real problem here is thatrecord-level filtering is mandatory if the engine decides to use filterpush-down. For engines with vectorized Parquet reader, like Spark,Parquet built-in record-level filtering is not performant enough.Actually, we observed that disabling filter push-down may even result inbetter performance when the data is not prepared for row group levelfiltering because the filter predicates are evaluated at Spark side withthe help of codegen. I think one possible improvement we can do here isto make record-level filtering optional. In this way, we may benefitfrom both Parquet built-in row group level filtering and fasterrecord-level filtering provided by upper level engines. Of course, whenrecord-level filtering is disabled, engines themselves are responsiblefor doing the filtering.


Cheng



On 7/7/16 2:43 AM, Ryan Blue wrote:

Hi Reynold,
Parquet uses the same predicates that are passed to the reader (viawithFilter [1]) for both record-level and row group filtering. We'vefound that the main benefit is when they can be used to eliminateentire row groups.
What bugs have you found? I've not seen problems with the filteringdone by Parquet so I'm surprised that you guys have seen so many(presumably that you've tracked to Parquet push-down?) that it doesn'tseem worth it.
Both record and row group filtering use the same predicates. Recordfiltering evaluates a predicate using an assembled record, so it isprobably slower than filtering in Spark SQL. This is faster forengines like Pig that don't have vectorized reads and would haveadditional calls on top of the Parquet layer. Also, the 2.0 spec makesit possible to filter individual data pages, but this hasn't beenimplemented.
In contrast to record-level, row group filtering is *very* valuablewhen data is correctly prepared. We have datasets where row groupfiltering gets us a 20-100x speedup (measured in Pig, Presto, andSpark) because we only need to read 1% of the data. This usescolumn-level stats from the footer and dictionaries to eliminate rowgroups that can't satisfy the query predicate. For example, for acolumn with min=5, max=26 and a predicate x < 0, we know that thereare no matching values. Similarly, we can look at a dictionary and seeall of the possible values and eliminate a row group if none of themmatch the predicate. Row group filtering works best with the datasorted within partitions by common query columns.
rb
[1]:https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L190
On Wed, Jul 6, 2016 at 11:13 AM, Reynold Xin <[email protected]<mailto:[email protected]>> wrote:
    Among the people working on Spark there are a lot of confusions
    about what
    Parquet's filter pushdown actually accomplishes. Depending on who
    I talk
    to, I get "it filters rows one by one" or "it skips blocks via min/max
    value tracking". Can I get a more official response on this?

    The reason I'm asking is that we have seen so many bugs related to
    filter
    pushdown (either bugs in Parquet, or bugs in Spark's
    implementation of it)
    that we are considering just permanently disabling filter
    pushdown, if the
    performance gain is not enormous.

    Let me know. Thanks.




--
Ryan Blue
Software Engineer
Netflix

Re: parquet-mr filter pushdown

Reply via email to