[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #4637: Core: Fix Partitions table filtering for evolved partition specs

GitBox Wed, 04 May 2022 11:42:04 -0700


RussellSpitzer commented on code in PR #4637:
URL: https://github.com/apache/iceberg/pull/4637#discussion_r865155248



##########
core/src/test/java/org/apache/iceberg/TestMetadataTableScans.java:
##########
@@ -550,6 +550,119 @@ public void testDeleteFilesTableSelection() throws 
IOException {
     Assert.assertEquals(expected, scan.schema().asStruct());
   }
 
+  @Test
+  public void testPartitionSpecEvolutionAdditive() {
+    preparePartitionedTable();
+
+    // Change spec and add two data files
+    table.updateSpec()
+        .addField("id")
+        .commit();
+    PartitionSpec newSpec = table.spec();
+
+    // Add two data files with new spec
+    PartitionKey data10Key = new PartitionKey(newSpec, table.schema());
+    data10Key.set(0, 0); // data=0
+    data10Key.set(1, 10); // id=10
+    DataFile data10 = DataFiles.builder(newSpec)
+        .withPath("/path/to/data-10.parquet")
+        .withRecordCount(10)
+        .withFileSizeInBytes(10)
+        .withPartition(data10Key)
+        .build();
+    PartitionKey data11Key = new PartitionKey(newSpec, table.schema());
+    data11Key.set(0, 1); // data=0
+    data10Key.set(1, 11); // id=11
+    DataFile data11 = DataFiles.builder(newSpec)
+        .withPath("/path/to/data-11.parquet")
+        .withRecordCount(10)
+        .withFileSizeInBytes(10)
+        .withPartition(data11Key)
+        .build();
+
+    table.newFastAppend().appendFile(data10).commit();
+    table.newFastAppend().appendFile(data11).commit();
+
+    Table metadataTable = new PartitionsTable(table.ops(), table);
+    Expression filter = Expressions.and(
+        Expressions.equal("partition.id", 10),
+        Expressions.greaterThan("record_count", 0));
+    TableScan scan = metadataTable.newScan().filter(filter);
+    CloseableIterable<FileScanTask> tasks = 
PartitionsTable.planFiles((StaticTableScan) scan);
+
+    // Four data files of old spec, one new data file of new spec
+    Assert.assertEquals(5, Iterables.size(tasks));
+
+    filter = Expressions.and(
+        Expressions.equal("partition.data_bucket", 0),
+        Expressions.greaterThan("record_count", 0));
+    scan = metadataTable.newScan().filter(filter);
+    tasks = PartitionsTable.planFiles((StaticTableScan) scan);
+
+    // 1 original data file written by old spec, plus 1 new data file written 
by new spec
+    Assert.assertEquals(2, Iterables.size(tasks));
+  }
+
+  @Test
+  public void testPartitionSpecEvolutionRemoval() {
+    preparePartitionedTable();
+
+    // Remove partition field
+    table.updateSpec()
+        .removeField(Expressions.bucket("data", 16))
+        .addField("id")
+        .commit();
+    PartitionSpec newSpec = table.spec();
+
+    // Add two data files with new spec
+    // Partition Fields are replaced in V1 with void and actually removed in V2
+    int partIndex = (formatVersion == 1) ? 1 : 0;
+    PartitionKey data10Key = new PartitionKey(newSpec, table.schema());
+    data10Key.set(partIndex, 10);
+    DataFile data10 = DataFiles.builder(newSpec)
+        .withPath("/path/to/data-10.parquet")
+        .withRecordCount(10)
+        .withFileSizeInBytes(10)
+        .withPartition(data10Key)
+        .build();
+    PartitionKey data11Key = new PartitionKey(newSpec, table.schema());
+    data11Key.set(partIndex, 11);
+    DataFile data11 = DataFiles.builder(newSpec)
+        .withPath("/path/to/data-11.parquet")
+        .withRecordCount(10)
+        .withFileSizeInBytes(10)
+        .withPartition(data11Key)
+        .build();
+
+    table.newFastAppend().appendFile(data10).commit();
+    table.newFastAppend().appendFile(data11).commit();
+
+    Table metadataTable = new PartitionsTable(table.ops(), table);
+    Expression filter = Expressions.and(
+        Expressions.equal("partition.id", 10),
+        Expressions.greaterThan("record_count", 0));
+    TableScan scan = metadataTable.newScan().filter(filter);
+    CloseableIterable<FileScanTask> tasks = 
PartitionsTable.planFiles((StaticTableScan) scan);
+
+    // Four original files of original spec, one data file written by new spec
+    Assert.assertEquals(5, Iterables.size(tasks));
+
+    filter = Expressions.and(
+        Expressions.equal("partition.data_bucket", 0),
+        Expressions.greaterThan("record_count", 0));
+    scan = metadataTable.newScan().filter(filter);
+    tasks = PartitionsTable.planFiles((StaticTableScan) scan);
+
+    if (formatVersion == 1) {
+      // 1 original data file written by old spec
+      Assert.assertEquals(1, Iterables.size(tasks));
+    } else {
+      // V2 drops the partition field so it is not used the planning, though 
data is still filtered out later
+      // 1 original data/delete files written by old spec, plus both of new 
data file/delete file written by new spec
+      Assert.assertEquals(3, Iterables.size(tasks));

Review Comment:
   I was just wondering if we need some kind of special filter, if you have a 
predicate on a column not present in the spec just return cannot match



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #4637: Core: Fix Partitions table filtering for evolved partition specs

Reply via email to