[GitHub] spark issue #22357: [SPARK-25363][SQL] Fix schema pruning in where clause by...

mallman Fri, 07 Sep 2018 21:43:24 -0700

Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/22357
  
    I have reconstructed my original patch for this issue, but I've discovered 
it will require more work to complete. However, as part of that reconstruction 
I've discovered a couple of cases where our patches create different physical 
plans. The query results are the same, but I'm not sure whichâif 
eitherâplan is correct. I want to go into detail on that, but it's 
complicated and I have to call it quits tonight. I have a flight in the 
morning, and I'll be on break next week.
    
    In the meantime, I'll just copy and paste two queriesâbased on the data 
in `ParquetSchemaPruningSuite.scala`âwith two query plans each.
    
    First query:
    
        select employer.id from contacts where employer is not null
    
    This PR (as of d68f808) produces:
    
    ```
    == Physical Plan ==
    *(1) Project [employer#4442.id AS id#4452]
    +- *(1) Filter isnotnull(employer#4442)
       +- *(1) FileScan parquet [employer#4442,p#4443] Batched: false, Format: 
Parquet,
        PartitionCount: 2, PartitionFilters: [], PushedFilters: 
[IsNotNull(employer)],
        ReadSchema: struct<employer:struct<id:int>>
    ```
    
    My WIP patch produces:
    
    ```
    == Physical Plan ==
    *(1) Project [employer#4442.id AS id#4452]
    +- *(1) Filter isnotnull(employer#4442)
       +- *(1) FileScan parquet [employer#4442,p#4443] Batched: false, Format: 
Parquet,
        PartitionCount: 2, PartitionFilters: [], PushedFilters: 
[IsNotNull(employer)],
        ReadSchema: 
struct<employer:struct<id:int,company:struct<name:string,address:string>>>
    ```
    
    Second query:
    
        select employer.id from contacts where employer.id = 0
    
    This PR produces:
    
    ```
    == Physical Plan ==
    *(1) Project [employer#4297.id AS id#4308]
    +- *(1) Filter (isnotnull(employer#4297) && (employer#4297.id = 0))
       +- *(1) FileScan parquet [employer#4297,p#4298] Batched: false, Format: 
Parquet,
        PartitionCount: 2, PartitionFilters: [], PushedFilters: 
[IsNotNull(employer)],
        ReadSchema: struct<employer:struct<id:int>>
    ```
    
    My WIP patch produces:
    
    ```
    == Physical Plan ==
    *(1) Project [employer#4445.id AS id#4456]
    +- *(1) Filter (isnotnull(employer#4445.id) && (employer#4445.id = 0))
       +- *(1) FileScan parquet [employer#4445,p#4446] Batched: false, Format: 
Parquet,
        PartitionCount: 2, PartitionFilters: [], PushedFilters: [],
        ReadSchema: struct<employer:struct<id:int>>
    ```
    
    I wanted to give my thoughts on the differences of these in detail, but I 
have to wrap up my work for the night. I'll be visiting family next week. I 
don't know how responsive I'll be in that time, but I'll at least try to check 
back.
    
    Cheers.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #22357: [SPARK-25363][SQL] Fix schema pruning in where clause by...

Reply via email to