liubo1022126 opened a new pull request #3093:
URL: https://github.com/apache/iceberg/pull/3093


   thanks @dixingxing0 @stevenzwu @rdblue 
   
   As https://github.com/apache/iceberg/pull/2109 (issue 
https://github.com/apache/iceberg/issues/2108) said, we need to have the 
timestamp of the data advancement in the snapshot, and also need historical 
detailed information of watermark for trend analysis, so this PR support write 
watermark in snapshot.
   
   Design is as follows: 
   
![image](https://user-images.githubusercontent.com/47106533/132643312-add56ba8-8e67-4901-827c-424cee3aee4e.png)
   
   1. Specify a field in the table as a timestamp field: 
"write.watermark.field", field type is Long.
   2. In  IcebergStreamWriter, advance the max("write.watermark.field") to the 
next operation.
   3. In IcebergFilesCommitter, get N(number of parallelism) advance 
information sent from step2, then choose the min of them.
        a. if the current min ts is greater than the last ts, then write the 
current ts to the snapshot as watermark.
        b. if the current min ts is not greater than the last ts, Keep the last 
ts to the snapshot as watermark.
   
   Users must be aware of some specifications:
   1. If there is no data for some parallelism writers, then watermark will not 
advance further.
   2. Watermark will only advance backward, not forward.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to