kbendick commented on a change in pull request #3213:
URL: https://github.com/apache/iceberg/pull/3213#discussion_r731184059
##########
File path: flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java
##########
@@ -295,10 +303,20 @@ public Builder uidPrefix(String newPrefix) {
// Add single-parallelism committer that commits files
// after successful checkpoint or end of input
- SingleOutputStreamOperator<Void> committerStream =
appendCommitter(writerStream);
+ SingleOutputStreamOperator<EndCheckpoint> committerStream =
appendCommitter(writerStream);
+
+ // Add single-parallelism compact task generator
+ SingleOutputStreamOperator<CommonControllerMessage> compactStream =
appendCompactGenerator(committerStream);
+
+ // Add parallel rewrite task operator
+ SingleOutputStreamOperator<CommonControllerMessage> rewriteStream =
+ appendCompactOperator(compactStream.broadcast());
+
+ // Add single-parallelism compact committer operator
+ SingleOutputStreamOperator<Void> compactCommitterStream =
appendCompactCommitter(rewriteStream);
Review comment:
Commented above, but do we think we should avoid adding the additional
nodes to the user's jobgraph if they've disabled compaction? It's extra
processing time and resources that aren't necessarily needed if they're not
doing anything. Would also help to visually see that you're not compacting
files when you (as a user) look at the job graph.
If users enable it, and it adds to their job graph and they don't properly
handle UIDs, given that we properly handle UIDs on our added operators, I can't
remember if that causes issues or not. Is likely a concern we should look into
as I know that not properly handling UIDs with job graph changes has caused
some of my users to have to abandon state before.
But for me, it feels like we shouldn't add unused processing nodes to the
job graph is this whole workflow is disabled (e.g. users might choose to focus
their flink compute on just writing as Flink doesn't autoscale as well as some
might like it to and instead compact files in a separate workflow entirely,
maybe with Spark even).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]