[GitHub] [iceberg] stevenzwu edited a comment on issue #2208: IcebergTableSink to write data into multiple iceberg tables

GitBox Fri, 05 Feb 2021 11:04:54 -0800


stevenzwu edited a comment on issue #2208:
URL: https://github.com/apache/iceberg/issues/2208#issuecomment-774178386



   Yeah, a single Kafka producer/sink supports writing to multiple Kafka topics 
as long as they are all on the same Kafka cluster. It is a comfortable 
situation for Kafka. However, it is not without some penalty though, as it will 
affect data batching and impact disk I/O on the broker side. 
   
   It is very expensive (and maybe impractical) for a single Iceberg sink to 
support growing and large number tables. The writers would need to keep many 
open files. That could lead to memory pressure for writer tasks. When it is 
time to checkpoint and commit, the writers need to flush and upload files for 
hundreds of tables and the committer needs to commit hundreds of tables. That 
would be very slow. I would suggest doing the demux before the sink jobs to 
Iceberg.
   
   Also if you have single Kafka topic holding different and growing number of 
datasets, you also loose the benefit of schema validation when ingesting data 
to Kafka. Having separate Kafka topic and schema validation for each dataset 
may also help with the data quality.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] stevenzwu edited a comment on issue #2208: IcebergTableSink to write data into multiple iceberg tables

Reply via email to