Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22357#discussion_r216545091
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
    @@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
         val projectionRootFields = projects.flatMap(getRootFields)
         val filterRootFields = filters.flatMap(getRootFields)
     
    -    (projectionRootFields ++ filterRootFields).distinct
    +    // Kind of expressions don't need to access any fields of a root 
fields, e.g., `IsNotNull`.
    +    // For them, if there are any nested fields accessed in the query, we 
don't need to add root
    +    // field access of above expressions.
    +    // For example, for a query `SELECT name.first FROM contacts WHERE 
name IS NOT NULL`,
    +    // we don't need to read nested fields of `name` struct other than 
`first` field.
    --- End diff --
    
    I'm having trouble accepting this, but perhaps I'm reading too much into it 
(or not enough). Let me illustrate with a couple of queries and their physical 
plans.
    
    Assuming the data model in `ParquetSchemaPruningSuite.scala`, the physical 
plan for the query
    
        select employer.id from contacts where employer is not null
    
    is
    
    ```
    == Physical Plan ==
    *(1) Project [employer#36.id AS id#46]
    +- *(1) Filter isnotnull(employer#36)
       +- *(1) FileScan parquet [employer#36,p#37] Batched: false, Format: 
Parquet,
        PartitionCount: 2, PartitionFilters: [], PushedFilters: 
[IsNotNull(employer)],
        ReadSchema: struct<employer:struct<id:int>>
    ```
    
    The physical plan for the query
    
        select employer.id from contacts where employer.id is not null
    
    is
    
    ```
    == Physical Plan ==
    *(1) Project [employer#36.id AS id#47]
    +- *(1) Filter (isnotnull(employer#36) && isnotnull(employer#36.id))
       +- *(1) FileScan parquet [employer#36,p#37] Batched: false, Format: 
Parquet,
        PartitionCount: 2, PartitionFilters: [], PushedFilters: 
[IsNotNull(employer)],
        ReadSchema: struct<employer:struct<id:int>>
    ```
    
    The read schemata are the same, but the query filters are not. The file 
scan for the second query looks as I would expect, but the scan for the first 
query appears to only read `employer.id` even though it needs to check 
`employer is not null`. If it only reads `employer.id`, how does it check that 
`employer.company` is not null? Perhaps `employer.id` is null but 
`employer.company` is not null for some row...
    
    I have run some tests to validate that this PR is returning the correct 
results for both queries, and it is. But I don't understand why.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to