Reo-LEI commented on issue #4621:
URL: https://github.com/apache/iceberg/issues/4621#issuecomment-1107710089

   I worked with Hang(@hangc0276 ) before to help Pulsar store cold data using 
Iceberg. As Hang mentioned above, Pulsar needs to read the data which are 
stored in Iceberg in the order of writting.
   
   But because the Iceberg writer will split the data file according to the 
number of records and file size, and Iceberg reads the data concurrently 
according to the file granularity. This results in that even if the upstream 
distributes the records with the same key to the same writer and writes them in 
the order of events, the data cannot be guaranteed to be read in the order of 
writing when reading (for example, in a snapshot, writer-1 write the data to 
dataFile-1 and dataFile-2 in the order of events, and the events in dataFile-2 
all occur later than those in dataFile-1. However, due to concurrent reading of 
records, the records in dataFile-2 may be read first than the records in 
dataFile-1).
   
   Not only in this case of Pulsar, but also in Flink to implement stream 
reading of complete changlog(+I/-U/+U/-D), records need to be read in the order 
of writing. So I think it is necessary for Iceberg to support the ability to 
read records in the order they were written.
   
   @rdblue @aokolnychyi @RussellSpitzer @kbendick @flyrain @jackye1995 
@stevenzwu @openinx 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to