[GitHub] [iceberg] stevenzwu commented on a change in pull request #2886: Flink: if provided, add uidPrefix to operator name so that Flink web …

GitBox Thu, 29 Jul 2021 10:52:36 -0700


stevenzwu commented on a change in pull request #2886:
URL: https://github.com/apache/iceberg/pull/2886#discussion_r679365088




##########
File path: flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java
##########
@@ -246,52 +248,71 @@ public Builder uidPrefix(String newPrefix) {
         }
       }
 
-      // Find out the equality field id list based on the user-provided 
equality field column names.
-      List<Integer> equalityFieldIds = Lists.newArrayList();
-      if (equalityFieldColumns != null && equalityFieldColumns.size() > 0) {
-        for (String column : equalityFieldColumns) {
-          org.apache.iceberg.types.Types.NestedField field = 
table.schema().findField(column);
-          Preconditions.checkNotNull(field, "Missing required equality field 
column '%s' in table schema %s",
-              column, table.schema());
-          equalityFieldIds.add(field.fieldId());
-        }
-      }
-
       // Convert the requested flink table schema to flink row type.
       RowType flinkRowType = toFlinkRowType(table.schema(), tableSchema);
 
       // Distribute the records from input data stream based on the 
write.distribution-mode.
       rowDataInput = distributeDataStream(rowDataInput, table.properties(), 
table.spec(), table.schema(), flinkRowType);
 
-      // Chain the iceberg stream writer and committer operator.
-      IcebergStreamWriter<RowData> streamWriter = createStreamWriter(table, 
flinkRowType, equalityFieldIds);
-      IcebergFilesCommitter filesCommitter = new 
IcebergFilesCommitter(tableLoader, overwrite);
+      // Add parallel writers that append rows to files
+      SingleOutputStreamOperator<WriteResult> writerStream = 
appendWriter(rowDataInput, flinkRowType);
 
-      this.writeParallelism = writeParallelism == null ? 
rowDataInput.getParallelism() : writeParallelism;
+      // Add single-parallelism committer that commits files
+      // after successful checkpoint or end of input
+      SingleOutputStreamOperator<Void> committerStream = 
appendCommitter(writerStream);
 
-      SingleOutputStreamOperator<WriteResult> writerStream = rowDataInput
-          .transform(ICEBERG_STREAM_WRITER_NAME, 
TypeInformation.of(WriteResult.class), streamWriter)
-          .setParallelism(writeParallelism);
+      // Add dummy discard sink
+      return appendDummySink(committerStream);
+    }
+
+    private String getOperatorName(String suffix) {
+      return uidPrefix != null ? uidPrefix + "-" + suffix : suffix;
+    }
+
+    private DataStreamSink<RowData> 
appendDummySink(SingleOutputStreamOperator<Void> committerStream) {
+      DataStreamSink<RowData> resultStream = committerStream
+          .addSink(new DiscardingSink())
+          .name(getOperatorName(String.format("IcebergSink %s", 
this.table.name())))
+          .setParallelism(1);
       if (uidPrefix != null) {
-        writerStream = writerStream.uid(uidPrefix + "-writer");
+        resultStream = resultStream.uid(uidPrefix + "-dummysink");
       }
+      return resultStream;
+    }
 
+    private SingleOutputStreamOperator<Void> 
appendCommitter(SingleOutputStreamOperator<WriteResult> writerStream) {
+      final IcebergFilesCommitter filesCommitter = new 
IcebergFilesCommitter(tableLoader, overwrite);
       SingleOutputStreamOperator<Void> committerStream = writerStream
-          .transform(ICEBERG_FILES_COMMITTER_NAME, Types.VOID, filesCommitter)
+          .transform(getOperatorName(ICEBERG_FILES_COMMITTER_NAME), 
Types.VOID, filesCommitter)
           .setParallelism(1)
           .setMaxParallelism(1);
       if (uidPrefix != null) {
         committerStream = committerStream.uid(uidPrefix + "-committer");
       }
+      return committerStream;
+    }
 
-      DataStreamSink<RowData> resultStream = committerStream
-          .addSink(new DiscardingSink())
-          .name(String.format("IcebergSink %s", table.name()))
-          .setParallelism(1);
+    private SingleOutputStreamOperator<WriteResult> 
appendWriter(DataStream<RowData> input, RowType flinkRowType) {

Review comment:
       we are still reference to the stream after `distributeDataStream`. still 
using the old style of mutable `rowDataInput`. Hence unit tests run fine. 
   
   I definitely see that it is easier to get confused. let me update the code 
to make it more consistent and clear




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] stevenzwu commented on a change in pull request #2886: Flink: if provided, add uidPrefix to operator name so that Flink web …

Reply via email to