hangc0276 commented on issue #4621: URL: https://github.com/apache/iceberg/issues/4621#issuecomment-1111133124
> You should be able to reconstruct an order, but I'm not sure whether you'd consider it the _same_ order or even if there is an _original_ record/row order. > > Most systems that work with Iceberg don't have an order because they process data in parallel tasks. Each task has a record order, but there's no order between tasks. For example, when Flink processes data from a Kafka topic, there is order within each Kafka partition, but no order across partitions or really across Flink tasks. Does Pulsar have a concept of total order over rows? > > I said above that you can reconstruct an order. That's because Iceberg keeps writes in order. For a given append operation, Iceberg writes the data file metadata into a manifest file in the original order. So all you need to do is read snapshots sequentially, order data files sequentially, and then read records from data files sequentially. We could formalize that a bit so that we can keep track of file order within a commit, but I'm skeptical that it is valuable given that most systems rely on partial ordering and not total ordering. @rdblue @RussellSpitzer @flyrain Thank you for your patient reply. For Pulsar, we only need partial ordering instead of total ordering. Let me explain how we integrate Pulsar topic with Iceberg. We will write one topic's messages into one iceberg table. For a Pulsar topic, it has many partitions, and for each partition, we will create an iceberg writer to deal with message writing. For one message, it will be fetched from one topic partition and then written into iceberg table with additional metadata fields, such as <partitionId, ledgerId, entryId> (<ledgerId, entryId> used to specify one message named `MessageId`). For messages from different topic partition, the additional metadata fields `partitionId` will be different, and we won't care about the message order between different topic partitions. For messages from the same topic partition, they will be written into iceberg table as the same order they stored in topic partition. The metadata fields <ledgerId, entryId> for messages will be strictly increasing. Such as [<1, 0>, <1, 1>, <1, 2>, <2, 0>, <2, 1>, <3, 0>, <3, 1>]. We will read records from iceberg table by specify `partitionId` and `MessageId` range. For example, we specify partitionId 0 and messageId range [<1, 0>, <10, 20>]. The iceberg reader need to return the records with MessageId strictly increasing order. The reader need to support partial key pair order for the returned records, otherwise, it's hard for Pulsar to reorder the records returned from Iceberg reader. Pulsar can keep writes order in partition. For Iceberg writer, it will write records sequentially into multiple parquet files. For Iceberg reader, we will specify partitionId and MessageId range to read, it can keep the record order in one parquet file, but it's hard to keep order between multiple parquet files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
