[GitHub] [iceberg] geonyeongkim commented on issue #7568: Flink DataStream Small file Issue And RewriteDataFiles Action

via GitHub Thu, 11 May 2023 20:01:10 -0700


geonyeongkim commented on issue #7568:
URL: https://github.com/apache/iceberg/issues/7568#issuecomment-1545038880


   @BsoBird 
   Oh, that's right.
   
   Hudi Flink confirmed that it provides a compaction operator.
   
   I wanted to know if Iceberg Flink offers the same, but it doesn't seem to be 
supported.
   
   If there are multiple writers in the Iceberg table, an error occurs.
   
   BsoBird are you pausing stream processing and rewriting to batch?
   
   Iceberg Flink is loading very small files in kb.
   Therefore, I don't know if the architecture is correct that stops the stream 
every few minutes and rewrites deployment takes place.
   
   Spark streaming allows iceberg loading to prevent small files as much as 
possible because it operates as microbatch, but to prevent data reversal of cdc 
data, an operation that writes the late one with the window function in the 
Spark engine must be performed.
   
   This results in significant throughput degradation due to shuffling.
   
   I ultimately want to load the cdc data as quickly as possible without 
getting a small file.
   
   BsoBird May I know more about your architecture utilizing Iceberg Flink?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] geonyeongkim commented on issue #7568: Flink DataStream Small file Issue And RewriteDataFiles Action

Reply via email to