Reo-LEI commented on pull request #2898:
URL: https://github.com/apache/iceberg/pull/2898#issuecomment-891584337


   > > Actually set `write.distribution-mode = hash` can not resolve this. 
Because CDC data shoule be distributed to primary key but not partition fields. 
For example, an iceberg table has equalityFields (dt, hour, id) and partition 
fields (dt, hour). If set `write.distribution-mode = hash`, the CDC data will 
be keyBy (dt, hour), and we can not ensure same id CDC data will send to same 
`IcebergStreamWriter`.
   > 
   > I'm a litter confused. Assume that the upstream data is orderly, if the 
CDC data are key by (dt, hour), the same id should be send to the same 
`IcebergStreamWriter`.
   > 
   > `dt=2021-08-03, hour=10, id=1` and `dt=2021-08-03, hour=10, id=2` should 
be the same `IcebergStreamWriter`. I think a partition is just a subset of the 
primary key. If anything is wrong, please let me know. Thanks!
   
   I think my example is inappropriate for CDC case which upstream of 
`IcebergStreamingWriter` is orderly. Because CDC data have -U record, that 
partition values can let record route to correct writer and delet the old 
record. 
   
   I think  set `write.distribution-mode = hash` will work only on iceberg 
table have partition fields and inpurt stream is CDC case. Onec inpurt stream 
is upsert stream or iceberg table is an unpartitioned table which don't have 
any partition fields,  set `write.distribution-mode = hash` will fail.
   
   The reason I don't agree with use `write.distribution-mode = hash` to avoid 
this problem is `write.distribution-mode` is use to reduce the number of small 
file but not to resolve row-level delete. We should not bind distribution-mode 
to row-level delete.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to