[GitHub] [iceberg] rdblue commented on issue #4621: [Feature Request] Iceberg integrates with Pulsar, supports java to read iceberg tables sequentially

GitBox Tue, 26 Apr 2022 16:17:17 -0700


rdblue commented on issue #4621:
URL: https://github.com/apache/iceberg/issues/4621#issuecomment-1110336966


   You should be able to reconstruct an order, but I'm not sure whether you'd 
consider it the _same_ order or even if there is an _original_ record/row order.
   
   Most systems that work with Iceberg don't have an order because they process 
data in parallel tasks. Each task has a record order, but there's no order 
between tasks. For example, when Flink processes data from a Kafka topic, there 
is order within each Kafka partition, but no order across partitions or really 
across Flink tasks. Does Pulsar have a concept of total order over rows?
   
   I said above that you can reconstruct an order. That's because Iceberg keeps 
writes in order. For a given append operation, Iceberg writes the data file 
metadata into a manifest file in the original order. So all you need to do is 
read snapshots sequentially, order data files sequentially, and then read 
records from data files sequentially. We could formalize that a bit so that we 
can keep track of file order within a commit, but I'm skeptical that it is 
valuable given that most systems rely on partial ordering and not total 
ordering.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on issue #4621: [Feature Request] Iceberg integrates with Pulsar, supports java to read iceberg tables sequentially

Reply via email to