liubo1022126 opened a new pull request #3093: URL: https://github.com/apache/iceberg/pull/3093
thanks @dixingxing0 @stevenzwu @rdblue As https://github.com/apache/iceberg/pull/2109 (issue https://github.com/apache/iceberg/issues/2108) said, we need to have the timestamp of the data advancement in the snapshot, and also need historical detailed information of watermark for trend analysis, so this PR support write watermark in snapshot. Design is as follows:  1. Specify a field in the table as a timestamp field: "write.watermark.field", field type is Long. 2. In IcebergStreamWriter, advance the max("write.watermark.field") to the next operation. 3. In IcebergFilesCommitter, get N(number of parallelism) advance information sent from step2, then choose the min of them. a. if the current min ts is greater than the last ts, then write the current ts to the snapshot as watermark. b. if the current min ts is not greater than the last ts, Keep the last ts to the snapshot as watermark. Users must be aware of some specifications: 1. If there is no data for some parallelism writers, then watermark will not advance further. 2. Watermark will only advance backward, not forward. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
