[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

rdblue Tue, 07 Aug 2018 11:57:11 -0700

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22009#discussion_r208347697
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsPushDownRequiredColumns.java
 ---
    @@ -21,22 +21,25 @@
     import org.apache.spark.sql.types.StructType;
     
     /**
    - * A mix-in interface for {@link DataSourceReader}. Data source readers 
can implement this
    + * A mix-in interface for {@link ScanConfigBuilder}. Data sources can 
implement this
      * interface to push down required columns to the data source and only 
read these columns during
      * scan to reduce the size of the data to be read.
      */
     @InterfaceStability.Evolving
    -public interface SupportsPushDownRequiredColumns extends DataSourceReader {
    +public interface SupportsPushDownRequiredColumns extends ScanConfigBuilder 
{
     
       /**
        * Applies column pruning w.r.t. the given requiredSchema.
        *
        * Implementation should try its best to prune the unnecessary columns 
or nested fields, but it's
        * also OK to do the pruning partially, e.g., a data source may not be 
able to prune nested
        * fields, and only prune top-level columns.
    -   *
    -   * Note that, data source readers should update {@link 
DataSourceReader#readSchema()} after
    -   * applying column pruning.
        */
       void pruneColumns(StructType requiredSchema);
    +
    +  /**
    +   * Returns the schema after the column pruning is applied, so that Spark 
can know if some
    +   * columns/nested fields are not pruned.
    +   */
    +  StructType prunedSchema();
    --- End diff --
    
    I don't see a reason to add this. Why not get the final schema from the 
`ScanConfig`? Getting the schema from the `ScanConfig` is better because it is 
clear when the pruned schema will be accessed: after all pushdown methods are 
called.
    
    That matters because filters may cause the source to require more columns 
and the source may choose to return those columns to Spark instead of adding a 
projection. Deferring the projection to Spark is more efficient if Spark was 
going to add one anyway.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

Reply via email to