rdblue commented on a change in pull request #1213:
URL: https://github.com/apache/iceberg/pull/1213#discussion_r463351988
##########
File path: core/src/main/java/org/apache/iceberg/io/UnpartitionedWriter.java
##########
@@ -17,24 +17,42 @@
* under the License.
*/
-package org.apache.iceberg.spark.source;
+package org.apache.iceberg.io;
import java.io.IOException;
import org.apache.iceberg.FileFormat;
import org.apache.iceberg.PartitionSpec;
-import org.apache.iceberg.io.FileIO;
-import org.apache.spark.sql.catalyst.InternalRow;
-class UnpartitionedWriter extends BaseWriter {
- UnpartitionedWriter(PartitionSpec spec, FileFormat format,
SparkAppenderFactory appenderFactory,
- OutputFileFactory fileFactory, FileIO io, long
targetFileSize) {
+public class UnpartitionedWriter<T> extends BaseTaskWriter<T> {
+
+ private RollingFileWriter currentWriter = null;
+
+ public UnpartitionedWriter(PartitionSpec spec, FileFormat format,
FileAppenderFactory<T> appenderFactory,
+ OutputFileFactory fileFactory, FileIO io, long
targetFileSize) {
super(spec, format, appenderFactory, fileFactory, io, targetFileSize);
+ }
- openCurrent();
+ @Override
+ public void write(T record) throws IOException {
+ if (currentWriter == null) {
+ currentWriter = new RollingFileWriter(null);
+ }
Review comment:
I think this PR should not change writers to be lazily created.
First, it changes the assumptions in the writers, which doesn't make sense
to include in what is primarily a refactor.
Second, I think those assumptions were a better structure for these classes.
Opening the file in the constructor and relying on it always being there avoids
a null check in `write`, which is called in a tight loop. The main benefit of
this is to avoid a delete in close when no records were written, but that check
is still present in `RollingFileWriter`. And I think that check _should_ be
there because it is another helpful invariant: if a 0-record file is produced
by any writer wrapped by `RollingFileWriter`, then it should be discarded. That
helps avoid the problem in future implementations, which may not consider the
case.
This is fairly minor, but since there are other changes needed (in
particular, the array fix for task commit messages), I'd like to change at
least the Spark writers back to eagerly creating output files instead of lazily
checking for null in `write`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]