RussellSpitzer commented on a change in pull request #2829:
URL: https://github.com/apache/iceberg/pull/2829#discussion_r705741534
##########
File path:
spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java
##########
@@ -149,7 +168,13 @@ public RewriteDataFiles filter(Expression expression) {
try {
Map<StructLike, List<FileScanTask>> filesByPartition =
Streams.stream(fileScanTasks)
- .collect(Collectors.groupingBy(task -> task.file().partition()));
+ .collect(Collectors.groupingBy(task -> {
+ if (task.file().specId() == table.spec().specId()) {
+ return task.file().partition();
+ } else {
+ return EmptyStruct.get();
Review comment:
The issue is for files with the old partitioning, if we have another way
of making an empty struct here that's fine. I can't remember why I chose to
make a new class here since it was a while ago now.
The core issue is
Say we have originally have a table which is partitioned Bucket(x, 5)
meaning any original data files are written with values of x more or less
randomly distributed in our data files. Then our table has a partitioning
changed to something like Bucket(x, 10). Worst case scenario is that when we
write we end up having to make 10 files for every partition in our original
bucketing. This says let's just assume all those files created with old
partitioning are best dealt with at the same time, rather than splitting them
up using the old partitioning.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]