stevenzwu commented on issue #2918:
URL: https://github.com/apache/iceberg/issues/2918#issuecomment-902187515


   @Reo-LEI great job on the issue description. thank you for the effort.
   
   > The FlinkCDCSource parallelism will be alway 1, because binlog data need 
to be send by serial. Consider Flink use rebalance as default shuffle strategy. 
Now we can see the CDC data will be rebalance to three different Filter and 
then emit different IcebergStreamWriter.
   
   I am not familiar with SQL or CDC parts of Flink. So the binlog is not first 
shipped to Kafka first here, right? Flink CDC source directly read binlog from 
MySQL?
   
   Personally, I don't see too much benefit of just scale up the parallelism 
for the IcebergStreamWriter operator. All the extra network shuffle/rebalance 
can be expensive too. I do see the need to scale up the job parallelism with 
higher traffic. hence I am wondering if we can also scale up the Flink CDC 
source operator along with other operators so that we can keep the chaining 
behavior.
   
   As you said already, this only works if the hash shuffle (based on the 
primary key) is the only network shuffle before the IcebergStreamWriter 
operator. That is why I have some reservations of putting this in the Flink 
Iceberg sink. This won't work if there is some other network shuffle before the 
Flink sink, which is outside of the control of the sink. Hence I am wondering 
if this should be handled outside of the Flink sink and we should just document 
the expected behavior of the input data to the Flink Iceberg sink.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to