[GitHub] [iceberg] rdblue commented on a change in pull request #1213: Abstract the generic task writers for sharing the common codes between spark and flink

GitBox Thu, 30 Jul 2020 18:03:43 -0700


rdblue commented on a change in pull request #1213:
URL: https://github.com/apache/iceberg/pull/1213#discussion_r463351988




##########
File path: core/src/main/java/org/apache/iceberg/io/UnpartitionedWriter.java
##########
@@ -17,24 +17,42 @@
  * under the License.
  */
 
-package org.apache.iceberg.spark.source;
+package org.apache.iceberg.io;
 
 import java.io.IOException;
 import org.apache.iceberg.FileFormat;
 import org.apache.iceberg.PartitionSpec;
-import org.apache.iceberg.io.FileIO;
-import org.apache.spark.sql.catalyst.InternalRow;
 
-class UnpartitionedWriter extends BaseWriter {
-  UnpartitionedWriter(PartitionSpec spec, FileFormat format, 
SparkAppenderFactory appenderFactory,
-                      OutputFileFactory fileFactory, FileIO io, long 
targetFileSize) {
+public class UnpartitionedWriter<T> extends BaseTaskWriter<T> {
+
+  private RollingFileWriter currentWriter = null;
+
+  public UnpartitionedWriter(PartitionSpec spec, FileFormat format, 
FileAppenderFactory<T> appenderFactory,
+                             OutputFileFactory fileFactory, FileIO io, long 
targetFileSize) {
     super(spec, format, appenderFactory, fileFactory, io, targetFileSize);
+  }
 
-    openCurrent();
+  @Override
+  public void write(T record) throws IOException {
+    if (currentWriter == null) {
+      currentWriter = new RollingFileWriter(null);
+    }

Review comment:
       I think this PR should not change writers to be lazily created.
   
   First, it changes the assumptions in the writers, which doesn't make sense 
to include in what is primarily a refactor.
   
   Second, I think those assumptions were a better structure for these classes. 
Opening the file in the constructor and relying on it always being there avoids 
a null check in `write`, which is called in a tight loop. The main benefit of 
this is to avoid a delete in close when no records were written, but that check 
is still present in `RollingFileWriter`. And I think that check _should_ be 
there because it is another helpful invariant: if a 0-record file is produced 
by any writer wrapped by `RollingFileWriter`, then it should be discarded. That 
helps avoid the problem in future implementations, which may not consider the 
case.
   
   This is fairly minor, but since there are other changes needed (in 
particular, the array fix for task commit messages), I'd like to change at 
least the Spark writers back to eagerly creating output files instead of lazily 
checking for null in `write`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #1213: Abstract the generic task writers for sharing the common codes between spark and flink

Reply via email to