[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

jose-torres Thu, 09 Aug 2018 10:53:12 -0700

Github user jose-torres commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22009#discussion_r208987767
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartition.java
 ---
    @@ -22,18 +22,16 @@
     import org.apache.spark.annotation.InterfaceStability;
     
     /**
    - * An input partition returned by {@link 
DataSourceReader#planInputPartitions()} and is
    - * responsible for creating the actual data reader of one RDD partition.
    - * The relationship between {@link InputPartition} and {@link 
InputPartitionReader}
    - * is similar to the relationship between {@link Iterable} and {@link 
java.util.Iterator}.
    + * An input partition returned by {@link 
ReadSupport#planInputPartitions(ScanConfig)}, which
    + * represents a data split that should be processed by one Spark task.
    --- End diff --
    
    I'm not sure we need to talk about "data split" - I don't think people will 
try to implement data sources without knowing what a partition is in Spark.
    
    I'd suggest saying "A serializable representation of an input 
partition...", to make it clear that this should just contain metadata required 
to identify what the partition is and not the actual data.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

Reply via email to