coolderli commented on pull request #2898: URL: https://github.com/apache/iceberg/pull/2898#issuecomment-891486040
> Actually set `write.distribution-mode = hash` can not resolve this. Because CDC data shoule be distributed to primary key but not partition fields. For example, an iceberg table has equalityFields (dt, hour, id) and partition fields (dt, hour). If set `write.distribution-mode = hash`, the CDC data will be keyBy (dt, hour), and we can not ensure same id CDC data will send to same `IcebergStreamWriter`. I'm a litter confused. Assume that the upstream data is orderly, if the CDC data are key by (dt, hour), the same id should be send to the same `IcebergStreamWriter`. `dt=2021-08-03, hour=10, id=1` and `dt=2021-08-03, hour=10, id=2` should be the same `IcebergStreamWriter`. I think a partition is just a subset of the primary key. If anything is wrong, please let me know. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
