Re: [PR] Fix NameMapping loss in ParquetUtil.footerMetrics [iceberg]

via GitHub Tue, 18 Nov 2025 23:00:06 -0800


ebyhr commented on code in PR #14617:
URL: https://github.com/apache/iceberg/pull/14617#discussion_r2540778041



##########
parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java:
##########
@@ -264,6 +268,51 @@ public void testColumnStatisticsEnabled() throws Exception 
{
     }
   }
 
+  @Test
+  public void testFooterMetricsWithNameMappingForFileWithoutIds() throws 
IOException {
+    Schema schemaWithIds =
+        new Schema(
+            required(1, "id", Types.LongType.get()), optional(2, "data", 
Types.StringType.get()));
+
+    NameMapping nameMapping = MappingUtil.create(schemaWithIds);
+
+    File file = createTempFile(temp);
+
+    // Write a Parquet file WITHOUT field IDs using plain Avro schema
+    org.apache.avro.Schema avroSchemaWithoutIds =
+        org.apache.avro.SchemaBuilder.record("test")
+            .fields()
+            .requiredLong("id")
+            .optionalString("data")
+            .endRecord();
+
+    ParquetWriter<GenericData.Record> writer =
+        AvroParquetWriter.<GenericData.Record>builder(new 
org.apache.hadoop.fs.Path(file.toURI()))
+            .withDataModel(GenericData.get())
+            .withSchema(avroSchemaWithoutIds)
+            .build();
+
+    GenericData.Record record = new GenericData.Record(avroSchemaWithoutIds);
+    record.put("id", 1L);
+    record.put("data", "a");
+    writer.write(record);
+    writer.close();
+
+    InputFile inputFile = Files.localInput(file);
+
+    try (ParquetFileReader reader = 
ParquetFileReader.open(ParquetIO.file(inputFile))) {
+      MessageType parquetSchema = 
reader.getFooter().getFileMetaData().getSchema();
+      assertThat(ParquetSchemaUtil.hasIds(parquetSchema)).isFalse();
+
+      Metrics metrics =
+          ParquetUtil.footerMetrics(
+              reader.getFooter(), Stream.empty(), MetricsConfig.getDefault(), 
nameMapping);
+
+      // The key assertion: column sizes should be keyed by field IDs from 
NameMapping
+      assertThat(metrics.columnSizes()).containsKeys(1, 2);

Review Comment:
   This assertion passes even when the `metrics.columnSizes()` contains 
additional keys. Could you use `containsOnlyKeys` instead? 



##########
parquet/src/test/java/org/apache/iceberg/parquet/TestParquet.java:
##########
@@ -264,6 +268,51 @@ public void testColumnStatisticsEnabled() throws Exception 
{
     }
   }
 
+  @Test
+  public void testFooterMetricsWithNameMappingForFileWithoutIds() throws 
IOException {
+    Schema schemaWithIds =
+        new Schema(
+            required(1, "id", Types.LongType.get()), optional(2, "data", 
Types.StringType.get()));
+
+    NameMapping nameMapping = MappingUtil.create(schemaWithIds);
+
+    File file = createTempFile(temp);
+
+    // Write a Parquet file WITHOUT field IDs using plain Avro schema
+    org.apache.avro.Schema avroSchemaWithoutIds =
+        org.apache.avro.SchemaBuilder.record("test")
+            .fields()
+            .requiredLong("id")
+            .optionalString("data")
+            .endRecord();
+
+    ParquetWriter<GenericData.Record> writer =
+        AvroParquetWriter.<GenericData.Record>builder(new 
org.apache.hadoop.fs.Path(file.toURI()))

Review Comment:
   `builder(Path file)` is deprecated. Can we `builder(OutputFile file)` 
instead? You can refer to #14620



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix NameMapping loss in ParquetUtil.footerMetrics [iceberg]

Reply via email to