RussellSpitzer commented on a change in pull request #3983:
URL: https://github.com/apache/iceberg/pull/3983#discussion_r830128555
##########
File path:
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
##########
@@ -994,6 +1000,83 @@ public void testCommitStateUnknownException() {
shouldHaveSnapshots(table, 2); // Commit actually Succeeded
}
+ @Test
+ public void testZOrderSort() {
+ int originalFiles = 20;
+ Table table = createTable(originalFiles);
+ shouldHaveLastCommitUnsorted(table, "c2");
+ shouldHaveFiles(table, originalFiles);
+
+ List<Object[]> originalData = currentData();
+ double originalFilesC2 = percentFilesRequired(table, "c2", "foo23");
+ double originalFilesC3 = percentFilesRequired(table, "c3", "bar21");
+ double originalFilesC2C3 = percentFilesRequired(table, new String[]{"c2",
"c3"}, new String[]{"foo23", "bar23"});
+
+ Assert.assertTrue("Should require all files to scan c2", originalFilesC2 >
0.99);
+ Assert.assertTrue("Should require all files to scan c3", originalFilesC3 >
0.99);
+
+ RewriteDataFiles.Result result =
+ basicRewrite(table)
+ .zOrder("c2", "c3")
+ .option(SortStrategy.MAX_FILE_SIZE_BYTES,
Integer.toString((averageFileSize(table) / 2) + 2))
+ // Divide files in 2
+ .option(RewriteDataFiles.TARGET_FILE_SIZE_BYTES,
Integer.toString(averageFileSize(table) / 2))
+ .option(SortStrategy.MIN_INPUT_FILES, "1")
+ .execute();
+
+ Assert.assertEquals("Should have 1 fileGroups", 1,
result.rewriteResults().size());
+ int zOrderedFilesTotal =
Iterables.size(table.currentSnapshot().addedFiles());
+ Assert.assertTrue("Should have written 40+ files", zOrderedFilesTotal >=
40);
Review comment:
Tiny changes in the parquet compression and it's ability to determine
file size make exact file count slightly iffy between versions. I have spent a
lot of time trying to fine tune the parameters here so that we make exactly the
number of files we want but it's difficult unless the files are big enough for
this test to take a considerable amount of time. :/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]