Re: Filter applied on merged Parquet shemsa with new column fails.

Cheng Lian Wed, 28 Oct 2015 00:53:34 -0700

Hey Hyukjin,

Sorry that I missed the JIRA ticket. Thanks for bring this issue uphere, your detailed investigation.

From my side, I think this is a bug of Parquet. Parquet was designed tosupport schema evolution. When scanning a Parquet, if a column exists inthe requested schema but missing in the file schema, that column isfilled with null. This should also hold for pushed-down predicatefilters. For example, if filter "a = 1" is pushed down but column "a"doesn't exist in the Parquet file being scanned, it's safe to assume "a"is null in all records and drop all of them. On the contrary, if "a ISNULL" is pushed down, all records should be preserved.

Apparently, before this issue is properly fixed on Parquet side, we needto workaround this issue from Spark side. Please see my comments of all3 of your solutions inlined below. In short, I'd like to have approach 1for branch-1.5 and approach 2 for master.


Cheng

On 10/28/15 10:11 AM, Hyukjin Kwon wrote:

When enabling mergedSchema and predicate filter, this failssince Parquet filters are pushed down regardless of each schema of thesplits (or rather files).
Dominic Ricard reported thisissue (https://issues.apache.org/jira/browse/SPARK-11103)
Even though this would work okay by settingspark.sql.parquet.filterPushdown to false, the default value of thisis true. So this looks an issue.
My questions are,
is this clearly an issue?
and if so, which way would this be handled?
I thought this is an issue and I made three rough patches for this andtested them and this looks fine though.
The first approach looks simpler and appropriate as I presume from theprevious approaches such ashttps://issues.apache.org/jira/browse/SPARK-11153However, in terms of safety and performances, I also want to ensurewhich one would be a proper approach before trying to open a PR.
1. Simply set false to spark.sql.parquet.filterPushdown when usingmergeSchema

This one is pretty simple and safe, I'd like to have this for 1.5.2, or1.5.3 if we can't make it for 1.5.2.

2. If spark.sql.parquet.filterPushdown is true, retrieve all theschema of every part-files (and also merged one) and check if each canaccept the given schema and then, apply the filter only when they allcan accept, which I think it's a bit over-implemented.

Actually we only need to calculate the intersection of all fileschemata. We can make ParquetRelation.mergeSchemaInParallel return twoStructTypes, the first one is the original merged schema, the other isthe intersection of all file schemata, which only contains fields thatexist in all file schemata. Then we decide which filter to pushed downaccording to the second StructType.

3. If spark.sql.parquet.filterPushdown is true, retrieve all theschema of every part-files (and also merged one) and apply the filterto each split (rather file) that can accept the filter which (I thinkit's hacky) ends up different configurations for each task in a job.

The idea I came up with at first was similar to this one. Instead ofpulling all file schemata to driver side, we can push filter push-downto executor side. Namely, passing candidate filters to executor side,and compute the Parquet predicate filter according to each file schema.I haven't looked into this direction in depth, but we can probably putthis part into CatalystReadSupport, which is now initialized on executorside.

However, correctness of this approach can only guaranteed by thedefensive filtering we do in Spark SQL (i.e. apply all the filters nomatter they are pushed down or not), but we are considering to remove itbecause it imposes unnecessary performance cost. This makes me hesitantto go along this way.

Re: Filter applied on merged Parquet shemsa with new column fails.

Reply via email to