aokolnychyi edited a comment on pull request #1972:
URL: https://github.com/apache/iceberg/pull/1972#issuecomment-794707731


   Thanks for the great work, @Fokko! 
   
   We have a person who is also interested in a Beam sink for Iceberg. It would 
be great to collaborate. @RussellSpitzer and I can help with reviews. I feel 
like this is not far from being ready.
   
   A couple of questions from me (I have just a basic understanding of Beam):
   - Is it correct that this PR supports only fixed windows in Beam? 
   - I head it is pretty common to write to multiple tables or determine which 
table to write to dynamically in Beam. Is it worth supporting such cases in the 
Iceberg sink? Does not have to be done in the first version, apparently.
   - Does Beam have any way to control the distribution and ordering of data 
that is passed for write? For example, we have added a way to ask Spark to 
distribute and order records according to the partition spec and sort order. A 
similar effort is in progress for Flink. The main reason is to reduce the 
number of produced files and order records accordingly to benefit from min/max 
skipping and better compression. Can people shuffle and sort data manually? Can 
we do that automatically?
   
   
   The first two points are minor but the last one can be a big deal at scale. 
For example, if we don't distribute and order data on write, we will either 
produce a lot of small files or keep a lot of files open at the same time.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to