Reo-LEI commented on issue #4621: URL: https://github.com/apache/iceberg/issues/4621#issuecomment-1107710089
I worked with Hang(@hangc0276 ) before to help Pulsar store cold data using Iceberg. As Hang mentioned above, Pulsar needs to read the data which are stored in Iceberg in the order of writting. But because the Iceberg writer will split the data file according to the number of records and file size, and Iceberg reads the data concurrently according to the file granularity. This results in that even if the upstream distributes the records with the same key to the same writer and writes them in the order of events, the data cannot be guaranteed to be read in the order of writing when reading (for example, in a snapshot, writer-1 write the data to dataFile-1 and dataFile-2 in the order of events, and the events in dataFile-2 all occur later than those in dataFile-1. However, due to concurrent reading of records, the records in dataFile-2 may be read first than the records in dataFile-1). Not only in this case of Pulsar, but also in Flink to implement stream reading of complete changlog(+I/-U/+U/-D), records need to be read in the order of writing. So I think it is necessary for Iceberg to support the ability to read records in the order they were written. @rdblue @aokolnychyi @RussellSpitzer @kbendick @flyrain @jackye1995 @stevenzwu @openinx -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
