Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/22357#discussion_r216565635
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
---
@@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends
Rule[LogicalPlan] {
val projectionRootFields = projects.flatMap(getRootFields)
val filterRootFields = filters.flatMap(getRootFields)
- (projectionRootFields ++ filterRootFields).distinct
+ // Kind of expressions don't need to access any fields of a root
fields, e.g., `IsNotNull`.
+ // For them, if there are any nested fields accessed in the query, we
don't need to add root
+ // field access of above expressions.
+ // For example, for a query `SELECT name.first FROM contacts WHERE
name IS NOT NULL`,
+ // we don't need to read nested fields of `name` struct other than
`first` field.
--- End diff --
> I checked in the ParquetFilter, IsNotNull(employer) will be ignored since
it's not a valid parquet filter as parquet doesn't support pushdown on the
struct; thus, with this PR, this query will return wrong answer.
We may not worry about wrong answer from datasource like Parquet in
predicate pushdown. As not all predicates are supported by pushdown, we always
have a SparkSQL Filter on top of scan node to make sure to receive correct
answer.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]