Github user mallman commented on the issue:
https://github.com/apache/spark/pull/22357
I have reconstructed my original patch for this issue, but I've discovered
it will require more work to complete. However, as part of that reconstruction
I've discovered a couple of cases where our patches create different physical
plans. The query results are the same, but I'm not sure whichâif
eitherâplan is correct. I want to go into detail on that, but it's
complicated and I have to call it quits tonight. I have a flight in the
morning, and I'll be on break next week.
In the meantime, I'll just copy and paste two queriesâbased on the data
in `ParquetSchemaPruningSuite.scala`âwith two query plans each.
First query:
select employer.id from contacts where employer is not null
This PR (as of d68f808) produces:
```
== Physical Plan ==
*(1) Project [employer#4442.id AS id#4452]
+- *(1) Filter isnotnull(employer#4442)
+- *(1) FileScan parquet [employer#4442,p#4443] Batched: false, Format:
Parquet,
PartitionCount: 2, PartitionFilters: [], PushedFilters:
[IsNotNull(employer)],
ReadSchema: struct<employer:struct<id:int>>
```
My WIP patch produces:
```
== Physical Plan ==
*(1) Project [employer#4442.id AS id#4452]
+- *(1) Filter isnotnull(employer#4442)
+- *(1) FileScan parquet [employer#4442,p#4443] Batched: false, Format:
Parquet,
PartitionCount: 2, PartitionFilters: [], PushedFilters:
[IsNotNull(employer)],
ReadSchema:
struct<employer:struct<id:int,company:struct<name:string,address:string>>>
```
Second query:
select employer.id from contacts where employer.id = 0
This PR produces:
```
== Physical Plan ==
*(1) Project [employer#4297.id AS id#4308]
+- *(1) Filter (isnotnull(employer#4297) && (employer#4297.id = 0))
+- *(1) FileScan parquet [employer#4297,p#4298] Batched: false, Format:
Parquet,
PartitionCount: 2, PartitionFilters: [], PushedFilters:
[IsNotNull(employer)],
ReadSchema: struct<employer:struct<id:int>>
```
My WIP patch produces:
```
== Physical Plan ==
*(1) Project [employer#4445.id AS id#4456]
+- *(1) Filter (isnotnull(employer#4445.id) && (employer#4445.id = 0))
+- *(1) FileScan parquet [employer#4445,p#4446] Batched: false, Format:
Parquet,
PartitionCount: 2, PartitionFilters: [], PushedFilters: [],
ReadSchema: struct<employer:struct<id:int>>
```
I wanted to give my thoughts on the differences of these in detail, but I
have to wrap up my work for the night. I'll be visiting family next week. I
don't know how responsive I'll be in that time, but I'll at least try to check
back.
Cheers.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]