Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22357#discussion_r216565635
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
    @@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
         val projectionRootFields = projects.flatMap(getRootFields)
         val filterRootFields = filters.flatMap(getRootFields)
     
    -    (projectionRootFields ++ filterRootFields).distinct
    +    // Kind of expressions don't need to access any fields of a root 
fields, e.g., `IsNotNull`.
    +    // For them, if there are any nested fields accessed in the query, we 
don't need to add root
    +    // field access of above expressions.
    +    // For example, for a query `SELECT name.first FROM contacts WHERE 
name IS NOT NULL`,
    +    // we don't need to read nested fields of `name` struct other than 
`first` field.
    --- End diff --
    
    > I checked in the ParquetFilter, IsNotNull(employer) will be ignored since 
it's not a valid parquet filter as parquet doesn't support pushdown on the 
struct; thus, with this PR, this query will return wrong answer.
    
    We may not worry about wrong answer from datasource like Parquet in 
predicate pushdown. As not all predicates are supported by pushdown, we always 
have a SparkSQL Filter on top of scan node to make sure to receive correct 
answer.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to