Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21948#discussion_r207294283
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java
---
@@ -50,4 +50,15 @@
* this ID will always be 0.
*/
DataWriter<T> createDataWriter(int partitionId, long taskId, long
epochId);
+
+ /**
+ * When true, Spark will reuse the same data object instance when
sending data to the data writer,
+ * for better performance. Data writers should carefully handle the data
objects if it's reused,
+ * e.g. do not buffer the data objects in a list. By default it returns
false for safety, data
+ * sources can override it if their data writers immediately write the
data object to somewhere
+ * else like a memory buffer or disk.
+ */
+ default boolean reuseDataObject() {
--- End diff --
I don't think this should be added in this commit. This is to move to
`InternalRow` and should not alter the API. I'm fine documenting this, but
writers are responsible for defensive copies if necessary. This default is
going to cause sources to be slower and I don't think it is necessary for
implementations that aren't tests buffering data in memory.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]