[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3983: Spark: Spark3 ZOrder Rewrite Strategy

GitBox Fri, 18 Mar 2022 08:48:24 -0700


RussellSpitzer commented on a change in pull request #3983:
URL: https://github.com/apache/iceberg/pull/3983#discussion_r830128555




##########
File path: 
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
##########
@@ -994,6 +1000,83 @@ public void testCommitStateUnknownException() {
     shouldHaveSnapshots(table, 2); // Commit actually Succeeded
   }
 
+  @Test
+  public void testZOrderSort() {
+    int originalFiles = 20;
+    Table table = createTable(originalFiles);
+    shouldHaveLastCommitUnsorted(table, "c2");
+    shouldHaveFiles(table, originalFiles);
+
+    List<Object[]> originalData = currentData();
+    double originalFilesC2 = percentFilesRequired(table, "c2", "foo23");
+    double originalFilesC3 = percentFilesRequired(table, "c3", "bar21");
+    double originalFilesC2C3 = percentFilesRequired(table, new String[]{"c2", 
"c3"}, new String[]{"foo23", "bar23"});
+
+    Assert.assertTrue("Should require all files to scan c2", originalFilesC2 > 
0.99);
+    Assert.assertTrue("Should require all files to scan c3", originalFilesC3 > 
0.99);
+
+    RewriteDataFiles.Result result =
+        basicRewrite(table)
+            .zOrder("c2", "c3")
+            .option(SortStrategy.MAX_FILE_SIZE_BYTES, 
Integer.toString((averageFileSize(table) / 2) + 2))
+            // Divide files in 2
+            .option(RewriteDataFiles.TARGET_FILE_SIZE_BYTES, 
Integer.toString(averageFileSize(table) / 2))
+            .option(SortStrategy.MIN_INPUT_FILES, "1")
+            .execute();
+
+    Assert.assertEquals("Should have 1 fileGroups", 1, 
result.rewriteResults().size());
+    int zOrderedFilesTotal = 
Iterables.size(table.currentSnapshot().addedFiles());
+    Assert.assertTrue("Should have written 40+ files", zOrderedFilesTotal >= 
40);

Review comment:
       Tiny changes in the parquet compression and it's ability to determine 
file size make exact file count slightly iffy between versions. I have spent a 
lot of time trying to fine tune the parameters here so that we make exactly the 
number of files we want but it's difficult unless the files are big enough for 
this test to take a considerable amount of time. :/




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #3983: Spark: Spark3 ZOrder Rewrite Strategy

Reply via email to