[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

cloud-fan Wed, 08 Aug 2018 22:13:40 -0700

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22009#discussion_r208808091
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
 ---
    @@ -76,41 +76,43 @@ object DataSourceV2Strategy extends Strategy {
       /**
        * Applies column pruning to the data source, w.r.t. the references of 
the given expressions.
        *
    -   * @return new output attributes after column pruning.
    +   * @return the created `ScanConfig`(since column pruning is the last 
step of operator pushdown),
    +   *         and new output attributes after column pruning.
    --- End diff --
    
    Technically column pruning is not operator pushdown. It's kind of a 
property of the SQL operator, indicating which columns/nested fields the parent 
operator needs.
    
    Column pruning is orthogonal to any operator pushdown: when operator 
pushdown is finished, Spark will check the remaining query plans at Spark 
side(assuming some predicates are not pushed), calculate the required columns 
for the scan node, and pass it as a hint to the data source.
    
    I'm fine with the current API, but we should have a better API for column 
pruning in the future.
    
    BTW for now I think adding `prunedSchema` to 
`SupportsPushdownRequiredColumns` is better. Data sources that don't support 
operator pushdown don't need to implement `ScanConfig.readSchema`. But I don't 
have a strong opinion.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

Reply via email to