[GitHub] [iceberg] openinx commented on issue #360: Spec: Add column equality delete files

GitBox Mon, 29 Jun 2020 21:47:58 -0700


openinx commented on issue #360:
URL: https://github.com/apache/iceberg/issues/360#issuecomment-651531457

> I have been thinking of an upsert case, where we get an entire replacement
row but not the previous row values.

The `upsert` without the previous row values depends on the primary key
specification, otherwise an `UPSERT(1,2)` don't know which row should it
change. The primary key will also introduce other questions:
1. whether the primary key columns should include the partition column and
bucket column ? if not, then a table with (a,b) columns, a is the primary key
and b is the partition key. for an `UPSERT(1,2)` its old row could be in any
partitions of the iceberg table, we will need broadcast to all partition ? That
seems resource wasting. If sure, then it will limit the usage, for example we
may adjust the bucket policy based on the fields to be JOIN between two table
to avoid the massive data shuffle. So when adjust the bucket policy we will
also need to adjust the primary key to include the new bucket columns ?
2. How to ensure the uniqueness in iceberg table ? Within a given bucket,
two different version of the same row may appear in two different data files,
then will we need the costly `JOIN` between data files (Previously, we only
need to JOIN between data files and delete files); Within a given iceberg
table, if primary key don't include the partition column, then two rows with
the same primary key may appear in two different partition, the pk
deduplication is a problem too.

In my opinion, CDC events from RDBMS should always provide both old values
and new values (such as `row-level` binlog). It will be good to handle this
case correctly first. About the other CDC events, such as `UPSERT` with only
primary key, it seems need more consideration.

> Using a unique ID for each CDC event could mitigate the problem, but
duplicate events from at-least-once processing would still incorrectly delete
data rows. How would you avoid this problem?

For spark streaming, it only guarantee the `at-least-once` semantics. Seems
hard to maintain the consistent data in sink table unless we provide a
identified global timestamp to indicate the before-after order. Let me think
more about this. For flink, it provide the `exactly-once` semantics, so it
seems easier.

> but I'm not sure it would enable a source to replay exactly the events
that were received -- after all, we're still separating inserts and deletes
into separate files so we can't easily reconstruct an upsert event.

I think the mixed equality-deletion/positional-deletion you described seems
hard to reconstruct the correct event order. The pure positional-deletion
described in this
[document](https://docs.google.com/document/d/1bBKDD4l-pQFXaMb4nOyVK-Sl3N2NTTG37uOCQx8rKVc/edit#heading=h.ljqc7bxmc6ej)
could restore the original order.

> I don't think that it makes a case for dynamic columns.

Yes, I agree that we don't need to care about the dynamic columns now. In
the real CDC events, it always provide all column values for a row. Please
ignore the propose about the dynamic columns encoding.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] openinx commented on issue #360: Spec: Add column equality delete files

Reply via email to