[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2829: Spark3 sort compaction

GitBox Thu, 09 Sep 2021 15:21:34 -0700


RussellSpitzer commented on a change in pull request #2829:
URL: https://github.com/apache/iceberg/pull/2829#discussion_r705741534




##########
File path: 
spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java
##########
@@ -149,7 +168,13 @@ public RewriteDataFiles filter(Expression expression) {
 
     try {
       Map<StructLike, List<FileScanTask>> filesByPartition = 
Streams.stream(fileScanTasks)
-          .collect(Collectors.groupingBy(task -> task.file().partition()));
+          .collect(Collectors.groupingBy(task -> {
+            if (task.file().specId() == table.spec().specId()) {
+              return task.file().partition();
+            } else {
+              return EmptyStruct.get();

Review comment:
       The issue is for files with the old partitioning, if we have another way 
of making an empty struct here that's fine. I can't remember why I chose to 
make a new class here since it was a while ago now.
   
   The core issue is
   Say we have originally have a table which is partitioned Bucket(x, 5) 
meaning any original data files are written with values of x more or less 
randomly distributed in our data files. Then our table has a partitioning 
changed to something like Bucket(x, 10). Worst case scenario is that when we 
write we end up having to make 10 files for every partition in our original 
bucketing. This says let's just assume all those files created with old 
partitioning are best dealt with at the same time, rather than splitting them 
up using the old partitioning.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2829: Spark3 sort compaction

Reply via email to