[GitHub] [iceberg] yyanyy commented on a change in pull request #1641: Add NaN counter to Metrics and implement in Parquet writers

GitBox Wed, 11 Nov 2020 16:50:09 -0800


yyanyy commented on a change in pull request #1641:
URL: https://github.com/apache/iceberg/pull/1641#discussion_r521743760




##########
File path: core/src/test/java/org/apache/iceberg/TestMetrics.java
##########
@@ -347,14 +327,120 @@ public void testMetricsForNullColumns() throws 
IOException {
     Record secondRecord = GenericRecord.create(schema);
     secondRecord.setField("intCol", null);
 
-    InputFile recordsFile = writeRecords(schema, firstRecord, secondRecord);
-
-    Metrics metrics = getMetrics(recordsFile);
+    Metrics metrics = getMetrics(schema, firstRecord, secondRecord);
     Assert.assertEquals(2L, (long) metrics.recordCount());
     assertCounts(1, 2L, 2L, metrics);
     assertBounds(1, IntegerType.get(), null, null, metrics);
   }
 
+  @Test
+  public void testMetricsForNaNColumns() throws IOException {
+    Record firstRecord = GenericRecord.create(FLOAT_DOUBLE_ONLY_SCHEMA);
+    firstRecord.setField("floatCol", Float.NaN);
+    firstRecord.setField("doubleCol", Double.NaN);
+    Record secondRecord = GenericRecord.create(FLOAT_DOUBLE_ONLY_SCHEMA);
+    secondRecord.setField("floatCol", Float.NaN);
+    secondRecord.setField("doubleCol", Double.NaN);
+
+    Metrics metrics = getMetrics(FLOAT_DOUBLE_ONLY_SCHEMA, firstRecord, 
secondRecord);
+    Assert.assertEquals(2L, (long) metrics.recordCount());
+    assertCounts(1, 2L, 0L, 2L, metrics);
+    assertCounts(2, 2L, 0L, 2L, metrics);
+    // below: current behavior; will be null once NaN is excluded from 
upper/lower bound
+    assertBounds(1, FloatType.get(), Float.NaN, Float.NaN, metrics);
+    assertBounds(2, DoubleType.get(), Double.NaN, Double.NaN, metrics);
+  }
+
+  @Test
+  public void testColumnBoundsWithNaNValueAtFront() throws IOException {
+    Record nonNaNRecord1 = GenericRecord.create(FLOAT_DOUBLE_ONLY_SCHEMA);
+    nonNaNRecord1.setField("floatCol", 1.2F);
+    nonNaNRecord1.setField("doubleCol", 3.4D);
+
+    Record nonNaNRecord2 = GenericRecord.create(FLOAT_DOUBLE_ONLY_SCHEMA);
+    nonNaNRecord2.setField("floatCol", 5.6F);
+    nonNaNRecord2.setField("doubleCol", 7.8D);
+
+    Record nanRecord = GenericRecord.create(FLOAT_DOUBLE_ONLY_SCHEMA);
+    nanRecord.setField("floatCol", Float.NaN);
+    nanRecord.setField("doubleCol", Double.NaN);
+
+    Metrics metrics = getMetrics(FLOAT_DOUBLE_ONLY_SCHEMA, nanRecord, 
nonNaNRecord1, nonNaNRecord2);
+    Assert.assertEquals(3L, (long) metrics.recordCount());
+    assertCounts(1, 3L, 0L, 1L, metrics);
+    assertCounts(2, 3L, 0L, 1L, metrics);
+
+    // below: current behavior; will be non-NaN values once NaN is excluded 
from upper/lower bound. ORC and Parquet's
+    // behaviors differ due to their implementation of comparison being 
different.
+    if (fileFormat() == FileFormat.ORC) {
+      assertBounds(1, FloatType.get(), Float.NaN, Float.NaN, metrics);
+      assertBounds(2, DoubleType.get(), Double.NaN, Double.NaN, metrics);

Review comment:
       I guess this means that today, we may skip including an ORC file for 
predicates that utilize bounds when the column to be evaluated contains non-NaN 
data but both upper and lower bound is NaN, which happens when the field of the 
first record in the file is NaN. Is my understanding correct? I can create an 
issue about this. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] yyanyy commented on a change in pull request #1641: Add NaN counter to Metrics and implement in Parquet writers

Reply via email to