[GitHub] [iceberg] kbendick commented on a change in pull request #3213: Flink: auto compact small files

GitBox Mon, 18 Oct 2021 11:02:28 -0700


kbendick commented on a change in pull request #3213:
URL: https://github.com/apache/iceberg/pull/3213#discussion_r731184059




##########
File path: flink/src/main/java/org/apache/iceberg/flink/sink/FlinkSink.java
##########
@@ -295,10 +303,20 @@ public Builder uidPrefix(String newPrefix) {
 
       // Add single-parallelism committer that commits files
       // after successful checkpoint or end of input
-      SingleOutputStreamOperator<Void> committerStream = 
appendCommitter(writerStream);
+      SingleOutputStreamOperator<EndCheckpoint> committerStream = 
appendCommitter(writerStream);
+
+      //  Add single-parallelism compact task generator
+      SingleOutputStreamOperator<CommonControllerMessage> compactStream = 
appendCompactGenerator(committerStream);
+
+      //  Add parallel rewrite task operator
+      SingleOutputStreamOperator<CommonControllerMessage> rewriteStream =
+          appendCompactOperator(compactStream.broadcast());
+
+      //  Add single-parallelism compact committer operator
+      SingleOutputStreamOperator<Void> compactCommitterStream = 
appendCompactCommitter(rewriteStream);

Review comment:
       Commented above, but do we think we should avoid adding the additional 
nodes to the user's jobgraph if they've disabled compaction? It's extra 
processing time and resources that aren't necessarily needed if they're not 
doing anything. Would also help to visually see that you're not compacting 
files when you (as a user) look at the job graph.
   
   If users enable it, and it adds to their job graph and they don't properly 
handle UIDs, given that we properly handle UIDs on our added operators, I can't 
remember if that causes issues or not. Is likely a concern we should look into 
as I know that not properly handling UIDs with job graph changes has caused 
some of my users to have to abandon state before.
   
   But for me, it feels like we shouldn't add unused processing nodes to the 
job graph is this whole workflow is disabled (e.g. users might choose to focus 
their flink compute on just writing as Flink doesn't autoscale as well as some 
might like it to and instead compact files in a separate workflow entirely, 
maybe with Spark even).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kbendick commented on a change in pull request #3213: Flink: auto compact small files

Reply via email to