[GitHub] [hudi] alexeykudinkin commented on a change in pull request #5077: [HUDI-3664] Handle type conversion for comparison of column range metadata

GitBox Mon, 28 Mar 2022 19:23:10 -0700


alexeykudinkin commented on a change in pull request #5077:
URL: https://github.com/apache/hudi/pull/5077#discussion_r836991954




##########
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColumnStatsIndex.scala
##########
@@ -63,6 +69,103 @@ class TestColumnStatsIndex extends HoodieClientTestBase {
     cleanupSparkContexts()
   }
 
+  @Test
+  def testMetadataColumnStatsIndex(): Unit = {
+    setTableName("hoodie_test")
+    initMetaClient()
+    val sourceJSONTablePath = 
getClass.getClassLoader.getResource("index/zorder/input-table-json").toString

Review comment:
       Let's remove the `z-order` dangling references 

##########
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##########
@@ -501,6 +502,109 @@ public static Object getNestedFieldVal(GenericRecord 
record, String fieldName, b
     }
   }
 
+  /**
+   * Get schema for the given field and record. Field can be nested, denoted 
by dot notation. e.g: a.b.c
+   *
+   * @param record    - record containing the value of the given field
+   * @param fieldName - name of the field
+   * @return
+   */
+  public static Schema getNestedFieldSchemaFromRecord(GenericRecord record, 
String fieldName) {
+    String[] parts = fieldName.split("\\.");
+    GenericRecord valueNode = record;
+    int i = 0;
+    for (; i < parts.length; i++) {
+      String part = parts[i];
+      Object val = valueNode.get(part);
+
+      if (i == parts.length - 1) {
+        return 
resolveNullableSchema(valueNode.getSchema().getField(part).schema());
+      } else {
+        if (!(val instanceof GenericRecord)) {
+          throw new HoodieException("Cannot find a record at part value :" + 
part);
+        }
+        valueNode = (GenericRecord) val;
+      }
+    }
+    throw new HoodieException("Failed to get schema. Not a valid field name: " 
+ fieldName);
+  }
+
+
+  /**
+   * Get schema for the given field and write schema. Field can be nested, 
denoted by dot notation. e.g: a.b.c
+   * Use this method when record is not available. Otherwise, prefer to use 
{@link #getNestedFieldSchemaFromRecord(GenericRecord, String)}
+   *
+   * @param writeSchema - write schema of the record
+   * @param fieldName   -  name of the field
+   * @return
+   */
+  public static Schema getNestedFieldSchemaFromWriteSchema(Schema writeSchema, 
String fieldName) {
+    String[] parts = fieldName.split("\\.");
+    int i = 0;
+    for (; i < parts.length; i++) {
+      String part = parts[i];
+      Schema schema = writeSchema.getField(part).schema();
+
+      if (i == parts.length - 1) {
+        return resolveNullableSchema(schema);
+      }
+    }
+    throw new HoodieException("Failed to get schema. Not a valid field name: " 
+ fieldName);
+  }
+
+  /**
+   * Given a field schema, convert its value to native Java type.
+   *
+   * @param schema - field schema
+   * @param val    - field value
+   * @return
+   */
+  public static Comparable<?> convertToNativeJavaType(Schema schema, Object 
val) {
+    if (val == null) {
+      return StringUtils.EMPTY_STRING;

Review comment:
       Should return null

##########
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##########
@@ -501,6 +502,109 @@ public static Object getNestedFieldVal(GenericRecord 
record, String fieldName, b
     }
   }
 
+  /**
+   * Get schema for the given field and record. Field can be nested, denoted 
by dot notation. e.g: a.b.c
+   *
+   * @param record    - record containing the value of the given field
+   * @param fieldName - name of the field
+   * @return
+   */
+  public static Schema getNestedFieldSchemaFromRecord(GenericRecord record, 
String fieldName) {
+    String[] parts = fieldName.split("\\.");
+    GenericRecord valueNode = record;
+    int i = 0;
+    for (; i < parts.length; i++) {
+      String part = parts[i];
+      Object val = valueNode.get(part);
+
+      if (i == parts.length - 1) {
+        return 
resolveNullableSchema(valueNode.getSchema().getField(part).schema());
+      } else {
+        if (!(val instanceof GenericRecord)) {
+          throw new HoodieException("Cannot find a record at part value :" + 
part);
+        }
+        valueNode = (GenericRecord) val;
+      }
+    }
+    throw new HoodieException("Failed to get schema. Not a valid field name: " 
+ fieldName);
+  }
+
+
+  /**
+   * Get schema for the given field and write schema. Field can be nested, 
denoted by dot notation. e.g: a.b.c
+   * Use this method when record is not available. Otherwise, prefer to use 
{@link #getNestedFieldSchemaFromRecord(GenericRecord, String)}
+   *
+   * @param writeSchema - write schema of the record
+   * @param fieldName   -  name of the field
+   * @return
+   */
+  public static Schema getNestedFieldSchemaFromWriteSchema(Schema writeSchema, 
String fieldName) {
+    String[] parts = fieldName.split("\\.");
+    int i = 0;
+    for (; i < parts.length; i++) {
+      String part = parts[i];
+      Schema schema = writeSchema.getField(part).schema();
+
+      if (i == parts.length - 1) {
+        return resolveNullableSchema(schema);
+      }
+    }
+    throw new HoodieException("Failed to get schema. Not a valid field name: " 
+ fieldName);
+  }
+
+  /**
+   * Given a field schema, convert its value to native Java type.
+   *
+   * @param schema - field schema
+   * @param val    - field value
+   * @return
+   */
+  public static Comparable<?> convertToNativeJavaType(Schema schema, Object 
val) {
+    if (val == null) {
+      return StringUtils.EMPTY_STRING;
+    }
+    if (schema.getLogicalType() == LogicalTypes.date()) {
+      return java.sql.Date.valueOf((val.toString()));
+    }
+    switch (schema.getType()) {
+      case UNION:

Review comment:
       We need to do union-unfolding externally to this method (otherwise every 
branch will have to do it)

##########
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##########
@@ -511,6 +512,132 @@ public static Object getNestedFieldVal(GenericRecord 
record, String fieldName, b
     }
   }
 
+  /**
+   * Get schema for the given field and record. Field can be nested, denoted 
by dot notation. e.g: a.b.c
+   *
+   * @param record    - record containing the value of the given field
+   * @param fieldName - name of the field
+   * @return
+   */
+  public static Schema getNestedFieldSchemaFromRecord(GenericRecord record, 
String fieldName) {

Review comment:
       This doesn't seem to be used anymore

##########
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColumnStatsIndex.scala
##########
@@ -63,6 +69,103 @@ class TestColumnStatsIndex extends HoodieClientTestBase {
     cleanupSparkContexts()
   }
 
+  @Test
+  def testMetadataColumnStatsIndex(): Unit = {
+    setTableName("hoodie_test")
+    initMetaClient()
+    val sourceJSONTablePath = 
getClass.getClassLoader.getResource("index/zorder/input-table-json").toString

Review comment:
       Actually, NVM. Will do in my cleanup PR

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -1061,23 +1064,29 @@ public static void aggregateColumnStats(IndexedRecord 
record, Schema schema,
     }
 
     schema.getFields().forEach(field -> {
-      Map<String, Object> columnStats = 
columnToStats.getOrDefault(field.name(), new HashMap<>());
-      final String fieldVal = getNestedFieldValAsString((GenericRecord) 
record, field.name(), true, consistentLogicalTimestampEnabled);
+      Map<String, Object> columnStats = columnToStats.get(field.name());
+      GenericRecord genericRecord = (GenericRecord) record;
+      final Object fieldVal = convertValueForSpecificDataTypes(field.schema(), 
genericRecord.get(field.name()), consistentLogicalTimestampEnabled);
+      final Schema fieldSchema = 
getNestedFieldSchemaFromWriteSchema(genericRecord.getSchema(), field.name());
       // update stats
-      final int fieldSize = fieldVal == null ? 0 : fieldVal.length();
-      columnStats.put(TOTAL_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_SIZE, 0).toString()) + fieldSize);
-      columnStats.put(TOTAL_UNCOMPRESSED_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_UNCOMPRESSED_SIZE, 0).toString()) 
+ fieldSize);
+      // NOTE: Unlike Parquet, Avro does not give the field size.
+      columnStats.put(TOTAL_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_SIZE, 0).toString()));
+      columnStats.put(TOTAL_UNCOMPRESSED_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_UNCOMPRESSED_SIZE, 
0).toString()));
 
-      if (!isNullOrEmpty(fieldVal)) {
+      if (fieldVal != null) {
         // set the min value of the field
         if (!columnStats.containsKey(MIN)) {
           columnStats.put(MIN, fieldVal);
         }
-        if (fieldVal.compareTo(String.valueOf(columnStats.get(MIN))) < 0) {
+        if (compare(fieldVal, columnStats.get(MIN), fieldSchema) < 0) {
           columnStats.put(MIN, fieldVal);
         }
         // set the max value of the field
-        if (fieldVal.compareTo(String.valueOf(columnStats.getOrDefault(MAX, 
""))) > 0) {
+        if (!columnStats.containsKey(MAX)) {
+          columnStats.put(MAX, fieldVal);
+        }
+        // set the max value of the field
+        if (compare(fieldVal, columnStats.get(MAX), fieldSchema) > 0) {

Review comment:
       This could also be `else if`

##########
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColumnStatsIndex.scala
##########
@@ -63,6 +69,91 @@ class TestColumnStatsIndex extends HoodieClientTestBase {
     cleanupSparkContexts()
   }
 
+  @Test
+  def testMetadataColumnStatsIndex(): Unit = {

Review comment:
       Somehow can't resolve my own comment. This is taken care of

##########
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##########
@@ -501,6 +502,109 @@ public static Object getNestedFieldVal(GenericRecord 
record, String fieldName, b
     }
   }
 
+  /**
+   * Get schema for the given field and record. Field can be nested, denoted 
by dot notation. e.g: a.b.c
+   *
+   * @param record    - record containing the value of the given field
+   * @param fieldName - name of the field
+   * @return
+   */
+  public static Schema getNestedFieldSchemaFromRecord(GenericRecord record, 
String fieldName) {
+    String[] parts = fieldName.split("\\.");
+    GenericRecord valueNode = record;
+    int i = 0;
+    for (; i < parts.length; i++) {
+      String part = parts[i];
+      Object val = valueNode.get(part);
+
+      if (i == parts.length - 1) {
+        return 
resolveNullableSchema(valueNode.getSchema().getField(part).schema());
+      } else {
+        if (!(val instanceof GenericRecord)) {
+          throw new HoodieException("Cannot find a record at part value :" + 
part);
+        }
+        valueNode = (GenericRecord) val;
+      }
+    }
+    throw new HoodieException("Failed to get schema. Not a valid field name: " 
+ fieldName);
+  }
+
+
+  /**
+   * Get schema for the given field and write schema. Field can be nested, 
denoted by dot notation. e.g: a.b.c
+   * Use this method when record is not available. Otherwise, prefer to use 
{@link #getNestedFieldSchemaFromRecord(GenericRecord, String)}
+   *
+   * @param writeSchema - write schema of the record
+   * @param fieldName   -  name of the field
+   * @return
+   */
+  public static Schema getNestedFieldSchemaFromWriteSchema(Schema writeSchema, 
String fieldName) {
+    String[] parts = fieldName.split("\\.");
+    int i = 0;
+    for (; i < parts.length; i++) {
+      String part = parts[i];
+      Schema schema = writeSchema.getField(part).schema();
+
+      if (i == parts.length - 1) {
+        return resolveNullableSchema(schema);
+      }
+    }
+    throw new HoodieException("Failed to get schema. Not a valid field name: " 
+ fieldName);
+  }
+
+  /**
+   * Given a field schema, convert its value to native Java type.
+   *
+   * @param schema - field schema
+   * @param val    - field value
+   * @return
+   */
+  public static Comparable<?> convertToNativeJavaType(Schema schema, Object 
val) {

Review comment:
       Should accept `Comparable`

##########
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##########
@@ -501,6 +502,109 @@ public static Object getNestedFieldVal(GenericRecord 
record, String fieldName, b
     }
   }
 
+  /**
+   * Get schema for the given field and record. Field can be nested, denoted 
by dot notation. e.g: a.b.c
+   *
+   * @param record    - record containing the value of the given field
+   * @param fieldName - name of the field
+   * @return
+   */
+  public static Schema getNestedFieldSchemaFromRecord(GenericRecord record, 
String fieldName) {
+    String[] parts = fieldName.split("\\.");
+    GenericRecord valueNode = record;
+    int i = 0;
+    for (; i < parts.length; i++) {
+      String part = parts[i];
+      Object val = valueNode.get(part);
+
+      if (i == parts.length - 1) {
+        return 
resolveNullableSchema(valueNode.getSchema().getField(part).schema());
+      } else {
+        if (!(val instanceof GenericRecord)) {
+          throw new HoodieException("Cannot find a record at part value :" + 
part);
+        }
+        valueNode = (GenericRecord) val;
+      }
+    }
+    throw new HoodieException("Failed to get schema. Not a valid field name: " 
+ fieldName);
+  }
+
+
+  /**
+   * Get schema for the given field and write schema. Field can be nested, 
denoted by dot notation. e.g: a.b.c
+   * Use this method when record is not available. Otherwise, prefer to use 
{@link #getNestedFieldSchemaFromRecord(GenericRecord, String)}
+   *
+   * @param writeSchema - write schema of the record
+   * @param fieldName   -  name of the field
+   * @return
+   */
+  public static Schema getNestedFieldSchemaFromWriteSchema(Schema writeSchema, 
String fieldName) {
+    String[] parts = fieldName.split("\\.");
+    int i = 0;
+    for (; i < parts.length; i++) {
+      String part = parts[i];
+      Schema schema = writeSchema.getField(part).schema();
+
+      if (i == parts.length - 1) {
+        return resolveNullableSchema(schema);
+      }
+    }
+    throw new HoodieException("Failed to get schema. Not a valid field name: " 
+ fieldName);
+  }
+
+  /**
+   * Given a field schema, convert its value to native Java type.
+   *
+   * @param schema - field schema
+   * @param val    - field value
+   * @return
+   */
+  public static Comparable<?> convertToNativeJavaType(Schema schema, Object 
val) {
+    if (val == null) {
+      return StringUtils.EMPTY_STRING;
+    }
+    if (schema.getLogicalType() == LogicalTypes.date()) {
+      return java.sql.Date.valueOf((val.toString()));
+    }
+    switch (schema.getType()) {
+      case UNION:
+        return convertToNativeJavaType(resolveNullableSchema(schema), val);
+      case STRING:
+        return val.toString();
+      case BYTES:
+        return (ByteBuffer) val;
+      case INT:
+        return (Integer) val;
+      case LONG:
+        return (Long) val;
+      case FLOAT:
+        return (Float) val;
+      case DOUBLE:
+        return (Double) val;
+      case BOOLEAN:
+        return (Boolean) val;
+      case ENUM:
+      case MAP:
+      case FIXED:
+      case NULL:
+      case RECORD:
+      case ARRAY:
+        return null;
+      default:
+        throw new IllegalStateException("Unexpected type: " + 
schema.getType());
+    }
+  }
+
+  /**
+   * Type-aware object comparison. Used to compare two objects for an Avro 
field.
+   */
+  public static int compare(Object o1, Object o2, Schema schema) {
+    if (Schema.Type.MAP.equals(schema.getType())) {

Review comment:
       We don't need this conditional

##########
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##########
@@ -501,6 +502,109 @@ public static Object getNestedFieldVal(GenericRecord 
record, String fieldName, b
     }
   }
 
+  /**
+   * Get schema for the given field and record. Field can be nested, denoted 
by dot notation. e.g: a.b.c
+   *
+   * @param record    - record containing the value of the given field
+   * @param fieldName - name of the field
+   * @return
+   */
+  public static Schema getNestedFieldSchemaFromRecord(GenericRecord record, 
String fieldName) {
+    String[] parts = fieldName.split("\\.");
+    GenericRecord valueNode = record;
+    int i = 0;
+    for (; i < parts.length; i++) {
+      String part = parts[i];
+      Object val = valueNode.get(part);
+
+      if (i == parts.length - 1) {
+        return 
resolveNullableSchema(valueNode.getSchema().getField(part).schema());
+      } else {
+        if (!(val instanceof GenericRecord)) {
+          throw new HoodieException("Cannot find a record at part value :" + 
part);
+        }
+        valueNode = (GenericRecord) val;
+      }
+    }
+    throw new HoodieException("Failed to get schema. Not a valid field name: " 
+ fieldName);
+  }
+
+
+  /**
+   * Get schema for the given field and write schema. Field can be nested, 
denoted by dot notation. e.g: a.b.c
+   * Use this method when record is not available. Otherwise, prefer to use 
{@link #getNestedFieldSchemaFromRecord(GenericRecord, String)}
+   *
+   * @param writeSchema - write schema of the record
+   * @param fieldName   -  name of the field
+   * @return
+   */
+  public static Schema getNestedFieldSchemaFromWriteSchema(Schema writeSchema, 
String fieldName) {
+    String[] parts = fieldName.split("\\.");
+    int i = 0;
+    for (; i < parts.length; i++) {
+      String part = parts[i];
+      Schema schema = writeSchema.getField(part).schema();
+
+      if (i == parts.length - 1) {
+        return resolveNullableSchema(schema);
+      }
+    }
+    throw new HoodieException("Failed to get schema. Not a valid field name: " 
+ fieldName);
+  }
+
+  /**
+   * Given a field schema, convert its value to native Java type.
+   *
+   * @param schema - field schema
+   * @param val    - field value
+   * @return
+   */
+  public static Comparable<?> convertToNativeJavaType(Schema schema, Object 
val) {
+    if (val == null) {
+      return StringUtils.EMPTY_STRING;
+    }
+    if (schema.getLogicalType() == LogicalTypes.date()) {
+      return java.sql.Date.valueOf((val.toString()));
+    }
+    switch (schema.getType()) {
+      case UNION:
+        return convertToNativeJavaType(resolveNullableSchema(schema), val);
+      case STRING:
+        return val.toString();
+      case BYTES:
+        return (ByteBuffer) val;
+      case INT:
+        return (Integer) val;
+      case LONG:
+        return (Long) val;
+      case FLOAT:
+        return (Float) val;
+      case DOUBLE:
+        return (Double) val;
+      case BOOLEAN:
+        return (Boolean) val;
+      case ENUM:
+      case MAP:
+      case FIXED:
+      case NULL:
+      case RECORD:
+      case ARRAY:
+        return null;
+      default:
+        throw new IllegalStateException("Unexpected type: " + 
schema.getType());
+    }
+  }
+
+  /**
+   * Type-aware object comparison. Used to compare two objects for an Avro 
field.
+   */
+  public static int compare(Object o1, Object o2, Schema schema) {

Review comment:
       Should also accept `Comparable`

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -1061,23 +1064,29 @@ public static void aggregateColumnStats(IndexedRecord 
record, Schema schema,
     }
 
     schema.getFields().forEach(field -> {
-      Map<String, Object> columnStats = 
columnToStats.getOrDefault(field.name(), new HashMap<>());
-      final String fieldVal = getNestedFieldValAsString((GenericRecord) 
record, field.name(), true, consistentLogicalTimestampEnabled);
+      Map<String, Object> columnStats = columnToStats.get(field.name());
+      GenericRecord genericRecord = (GenericRecord) record;
+      final Object fieldVal = convertValueForSpecificDataTypes(field.schema(), 
genericRecord.get(field.name()), consistentLogicalTimestampEnabled);
+      final Schema fieldSchema = 
getNestedFieldSchemaFromWriteSchema(genericRecord.getSchema(), field.name());
       // update stats
-      final int fieldSize = fieldVal == null ? 0 : fieldVal.length();
-      columnStats.put(TOTAL_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_SIZE, 0).toString()) + fieldSize);
-      columnStats.put(TOTAL_UNCOMPRESSED_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_UNCOMPRESSED_SIZE, 0).toString()) 
+ fieldSize);
+      // NOTE: Unlike Parquet, Avro does not give the field size.
+      columnStats.put(TOTAL_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_SIZE, 0).toString()));
+      columnStats.put(TOTAL_UNCOMPRESSED_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_UNCOMPRESSED_SIZE, 
0).toString()));
 
-      if (!isNullOrEmpty(fieldVal)) {
+      if (fieldVal != null) {
         // set the min value of the field
         if (!columnStats.containsKey(MIN)) {
           columnStats.put(MIN, fieldVal);
         }
-        if (fieldVal.compareTo(String.valueOf(columnStats.get(MIN))) < 0) {
+        if (compare(fieldVal, columnStats.get(MIN), fieldSchema) < 0) {

Review comment:
       Let's make this `else if`

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -1032,11 +1035,11 @@ public static void accumulateColumnRanges(Schema.Field 
field, String filePath,
                                             Map<String, 
HoodieColumnRangeMetadata<Comparable>> columnRangeMap,

Review comment:
       Let's move `accumulateColumnRanges` closer to where it's actually used 
(into `HoodieAppendHandle`)

##########
File path: 
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieBackedMetadata.java
##########
@@ -2026,11 +2027,57 @@ private void validateMetadata(SparkRDDWriteClient 
testClient) throws IOException
       assertTrue(latestSlices.size()
           <= (numFileVersions * 
metadataEnabledPartitionTypes.get(partition).getFileGroupCount()), "Should 
limit file slice to "
           + numFileVersions + " per file group, but was " + 
latestSlices.size());
+      List<HoodieLogFile> logFiles = 
latestSlices.get(0).getLogFiles().collect(Collectors.toList());
+      try {
+        if (MetadataPartitionType.FILES.getPartitionPath().equals(partition)) {
+          verifyMetadataRawRecords(table, logFiles, false);
+        }
+        if 
(MetadataPartitionType.COLUMN_STATS.getPartitionPath().equals(partition)) {
+          verifyMetadataColumnStatsRecords(logFiles);
+        }
+      } catch (IOException e) {
+        LOG.error("Metadata record validation failed", e);
+        fail("Metadata record validation failed");
+      }
     });
 
     LOG.info("Validation time=" + timer.endTimer());
   }
 
+  private void verifyMetadataColumnStatsRecords(List<HoodieLogFile> logFiles) 
throws IOException {
+    for (HoodieLogFile logFile : logFiles) {
+      FileStatus[] fsStatus = fs.listStatus(logFile.getPath());
+      MessageType writerSchemaMsg = 
TableSchemaResolver.readSchemaFromLogFile(fs, logFile.getPath());
+      if (writerSchemaMsg == null) {
+        // not a data block
+        continue;
+      }
+
+      Schema writerSchema = new AvroSchemaConverter().convert(writerSchemaMsg);
+      HoodieLogFormat.Reader logFileReader = HoodieLogFormat.newReader(fs, new 
HoodieLogFile(fsStatus[0].getPath()), writerSchema);
+
+      while (logFileReader.hasNext()) {
+        HoodieLogBlock logBlock = logFileReader.next();
+        if (logBlock instanceof HoodieDataBlock) {
+          try (ClosableIterator<IndexedRecord> recordItr = ((HoodieDataBlock) 
logBlock).getRecordItr()) {
+            recordItr.forEachRemaining(indexRecord -> {
+              final GenericRecord record = (GenericRecord) indexRecord;
+              final GenericRecord colStatsRecord = (GenericRecord) 
record.get(HoodieMetadataPayload.SCHEMA_FIELD_ID_COLUMN_STATS);
+              assertNotNull(colStatsRecord);
+              
assertNotNull(colStatsRecord.get(HoodieMetadataPayload.COLUMN_STATS_FIELD_COLUMN_NAME));
+              
assertNotNull(colStatsRecord.get(HoodieMetadataPayload.COLUMN_STATS_FIELD_NULL_COUNT));
+              /**
+               * TODO: some types of field may have null min/max as these 
statistics are only supported for primitive types

Review comment:
       Comment is rather misleading: min/max stats could be null, but they are 
supported for composite typs as well (composite payload would be converted to 
byte string and compared as such)

##########
File path: 
hudi-spark-datasource/hudi-spark/src/test/resources/index/zorder/update-column-stats-index-table.json
##########
@@ -0,0 +1,26 @@
+{"minValue":"0","maxValue":"959","nullCount":0,"columnName":"c1"}
+{"minValue":"64.768","maxValue":"979.272","nullCount":0,"columnName":"c3"}
+{"minValue":"20220328202039002","maxValue":"20220328202039002","nullCount":0,"columnName":"_hoodie_commit_time"}
+{"minValue":"20220328202039002_0_41","maxValue":"20220328202039002_0_80","nullCount":0,"columnName":"_hoodie_commit_seqno"}
+{"minValue":"20220328202022669_0_1","maxValue":"20220328202022669_0_9","nullCount":0,"columnName":"_hoodie_commit_seqno"}
+{"minValue":"2020-01-01","maxValue":"2020-11-21","nullCount":0,"columnName":"c6"}
+{"minValue":"2020-01-01","maxValue":"2020-11-22","nullCount":0,"columnName":"c6"}
+{"minValue":" 111sdc","maxValue":" 8sdc","nullCount":0,"columnName":"c2"}
+{"minValue":"1637307284159000","maxValue":"1637307284201000","nullCount":0,"columnName":"c4"}
+{"minValue":"10deb9bc-f7b0-4c5c-8cd4-eccb92788c8c-0_0-70-115_20220328202039002.parquet","maxValue":"10deb9bc-f7b0-4c5c-8cd4-eccb92788c8c-0_0-70-115_20220328202039002.parquet","nullCount":0,"columnName":"_hoodie_file_name"}
+{"minValue":"19.000","maxValue":"994.355","nullCount":0,"columnName":"c3"}
+{"minValue":" 0sdc","maxValue":" 959sdc","nullCount":0,"columnName":"c2"}
+{"minValue":"8","maxValue":"770","nullCount":0,"columnName":"c1"}
+{"minValue":"b8dc14a9-c067-4b9c-9d29-eaa09bd35c4b-0_0-23-37_20220328202022669.parquet","maxValue":"b8dc14a9-c067-4b9c-9d29-eaa09bd35c4b-0_0-23-37_20220328202022669.parquet","nullCount":0,"columnName":"_hoodie_file_name"}
+{"minValue":"20220328202022669","maxValue":"20220328202022669","nullCount":0,"columnName":"_hoodie_commit_time"}
+{"minValue":"1637383255339000","maxValue":"1637383255550000","nullCount":0,"columnName":"c4"}
+{"minValue":"1","maxValue":"97","nullCount":0,"columnName":"c5"}
+{"minValue":"0","maxValue":"959","nullCount":0,"columnName":"_hoodie_record_key"}
+{"minValue":"9","maxValue":"9","nullCount":0,"columnName":"c8"}
+{"minValue":"9","maxValue":"9","nullCount":0,"columnName":"c8"}
+{"minValue":"","maxValue":"","nullCount":0,"columnName":"_hoodie_partition_path"}
+{"minValue":"111","maxValue":"8","nullCount":0,"columnName":"_hoodie_record_key"}
+{"minValue":"2","maxValue":"78","nullCount":0,"columnName":"c5"}
+{"minValue":"","maxValue":"","nullCount":0,"columnName":"_hoodie_partition_path"}
+{"minValue":"java.nio.HeapByteBuffer[pos=0 lim=1 
cap=1]","maxValue":"java.nio.HeapByteBuffer[pos=0 lim=1 
cap=1]","nullCount":0,"columnName":"c7"}

Review comment:
       How does `"minValue":"java.nio.HeapByteBuffer[pos=0 lim=1 cap=1]"` pass 
the assertion test?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #5077: [HUDI-3664] Handle type conversion for comparison of column range metadata

Reply via email to