Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/22009#discussion_r208808091
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
---
@@ -76,41 +76,43 @@ object DataSourceV2Strategy extends Strategy {
/**
* Applies column pruning to the data source, w.r.t. the references of
the given expressions.
*
- * @return new output attributes after column pruning.
+ * @return the created `ScanConfig`(since column pruning is the last
step of operator pushdown),
+ * and new output attributes after column pruning.
--- End diff --
Technically column pruning is not operator pushdown. It's kind of a
property of the SQL operator, indicating which columns/nested fields the parent
operator needs.
Column pruning is orthogonal to any operator pushdown: when operator
pushdown is finished, Spark will check the remaining query plans at Spark
side(assuming some predicates are not pushed), calculate the required columns
for the scan node, and pass it as a hint to the data source.
I'm fine with the current API, but we should have a better API for column
pruning in the future.
BTW for now I think adding `prunedSchema` to
`SupportsPushdownRequiredColumns` is better. Data sources that don't support
operator pushdown don't need to implement `ScanConfig.readSchema`. But I don't
have a strong opinion.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]