rdblue commented on a change in pull request #3661:
URL: https://github.com/apache/iceberg/pull/3661#discussion_r770724140
##########
File path:
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java
##########
@@ -178,8 +178,25 @@ public DistributionMode distributionMode() {
}
}
- public DistributionMode deleteDistributionMode() {
- return
rowLevelCommandDistributionMode(TableProperties.DELETE_DISTRIBUTION_MODE);
+ public DistributionMode copyOnWriteDeleteDistributionMode() {
+ String deleteModeName = confParser.stringConf()
+ .option(SparkWriteOptions.DISTRIBUTION_MODE)
+ .tableProperty(TableProperties.DELETE_DISTRIBUTION_MODE)
+ .parseOptional();
+
+ if (deleteModeName != null) {
+ DistributionMode deleteMode = DistributionMode.fromName(deleteModeName);
+ if (deleteMode == RANGE && table.spec().isUnpartitioned() &&
table.sortOrder().isUnsorted()) {
Review comment:
If the delete mode was specifically set to RANGE, I think we should
probably pass it through. I see the logic of not doing that if we're going to
use no custom sort order, but I think we should sort by `_file` and `_pos` if
there is no order to keep rows in the same files and original order.
If we're using `_file` and `_pos`, then RANGE would mean to rebalance
records across files. That makes some sense if records were already mostly in
sorted order. HASH would maintain existing file boundaries. The question for me
is whether we should assume that RANGE indicates that it's okay to change file
boundaries... If we had a total ordering before and deleted large chunks of
records, this would make sense. But if the files/records weren't already a
total ordering then it could mess up column stats. Whether this assumption
holds is not clear, so I would say let's defer to the user's explicit request,
since this is the delete mode.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]