rdblue commented on a change in pull request #4047:
URL: https://github.com/apache/iceberg/pull/4047#discussion_r805432750
##########
File path:
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/TestSparkDistributionAndOrderingUtil.java
##########
@@ -1567,6 +1567,370 @@ public void
testRangePositionDeltaUpdatePartitionedTable() {
table, UPDATE, expectedDistribution,
SPEC_ID_PARTITION_FILE_POSITION_ORDERING);
}
+ //
==================================================================================
+ // Distribution and ordering for merge-on-read MERGE operations with
position deletes
+ //
==================================================================================
+ //
+ // UNPARTITIONED UNORDERED
+ // -------------------------------------------------------------------------
+ // merge mode is NOT SET -> rely on write distribution and ordering as a
basis
+ // merge mode is NONE -> unspecified distribution + LOCALLY ORDER BY
_spec_id, _partition, _file, _pos
+ // merge mode is HASH -> unspecified distribution + LOCALLY ORDER BY
_spec_id, _partition, _file, _pos
+ // merge mode is RANGE -> unspecified distribution + LOCALLY ORDER BY
_spec_id, _partition, _file, _pos
+ //
+ // UNPARTITIONED ORDERED BY id, data
+ // -------------------------------------------------------------------------
+ // merge mode is NOT SET -> rely on write distribution and ordering as a
basis
+ // merge mode is NONE -> unspecified distribution +
+ // LOCALLY ORDER BY _spec_id, _partition, _file, _pos,
id, data
+ // merge mode is HASH -> unspecified distribution +
Review comment:
Oh, I think I see. I was thinking about the `PARTITIONED BY, UNORDERED`
case that is actually below. I concluded what you did for that case, so that's
good validation!
Here, it still seems bad to me not to distribute. That's going to result in
a lot of small delete files, which is really expensive and possibly worse than
having a single writer for all the inserted data. It would be nice to be able
to round-robin the new data... what about using something like `HASH DISTRIBUTE
BY _spec, _partition, bucket(id, data, numShufflePartitions)`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]