[GitHub] spark pull request #21948: [SPARK-24991][SQL] use InternalRow in DataSourceW...

rdblue Thu, 02 Aug 2018 09:37:43 -0700

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21948#discussion_r207294283
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java
 ---
    @@ -50,4 +50,15 @@
        *                this ID will always be 0.
        */
       DataWriter<T> createDataWriter(int partitionId, long taskId, long 
epochId);
    +
    +  /**
    +   * When true, Spark will reuse the same data object instance when 
sending data to the data writer,
    +   * for better performance. Data writers should carefully handle the data 
objects if it's reused,
    +   * e.g. do not buffer the data objects in a list. By default it returns 
false for safety, data
    +   * sources can override it if their data writers immediately write the 
data object to somewhere
    +   * else like a memory buffer or disk.
    +   */
    +  default boolean reuseDataObject() {
    --- End diff --
    
    I don't think this should be added in this commit. This is to move to 
`InternalRow` and should not alter the API. I'm fine documenting this, but 
writers are responsible for defensive copies if necessary. This default is 
going to cause sources to be slower and I don't think it is necessary for 
implementations that aren't tests buffering data in memory.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21948: [SPARK-24991][SQL] use InternalRow in DataSourceW...

Reply via email to