yihua commented on a change in pull request #5181:
URL: https://github.com/apache/hudi/pull/5181#discussion_r840042665



##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##########
@@ -1508,7 +1508,7 @@ public boolean isMetadataBloomFilterIndexEnabled() {
   }
 
   public boolean isMetadataColumnStatsIndexEnabled() {
-    return isMetadataTableEnabled() && 
getMetadataConfig().isColumnStatsIndexEnabled();
+    return isMetadataTableEnabled() && 
getMetadataConfig().isColumnStatsIndexEnabled() && 
getMetadataConfig().isColumnStatsIndexEnabled();

Review comment:
       nit: is this redundant?

##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/columnstats/ColumnStatsIndexHelper.java
##########
@@ -63,9 +63,10 @@
 import java.util.stream.Collectors;
 import java.util.stream.StreamSupport;
 
+// TODO merge w/ ColumnStatsIndexSupport
 public class ColumnStatsIndexHelper {
 
-  private static final String COLUMN_STATS_INDEX_FILE_COLUMN_NAME = "file";
+  private static final String COLUMN_STATS_INDEX_FILE_COLUMN_NAME = "fileName";

Review comment:
       I assume this is only used on the read side in intermediate dataset 
representation, not to be coupled with the metadata payload schema.

##########
File path: hudi-common/src/main/avro/HoodieMetadata.avsc
##########
@@ -121,16 +121,192 @@
                             "doc": "Minimum value in the range. Based on user 
data table schema, we can convert this to appropriate type",
                             "name": "minValue",
                             "type": [
+                                // Those types should be aligned with Parquet 
`Statistics` impl
+                                // making sure that we implement semantic 
consistent across file formats
+                                //
+                                // NOTE: Other logical types (decimal, date, 
timestamp, etc) will be converted
+                                //       into one of the following types, 
making sure that their corresponding
+                                //       ordering is preserved
                                 "null",
-                                "string"
+                                {

Review comment:
       should we keep `"string"` for compatibility across SNAPSHOTS releases?

##########
File path: hudi-common/src/main/avro/HoodieMetadata.avsc
##########
@@ -121,16 +121,192 @@
                             "doc": "Minimum value in the range. Based on user 
data table schema, we can convert this to appropriate type",
                             "name": "minValue",
                             "type": [
+                                // Those types should be aligned with Parquet 
`Statistics` impl
+                                // making sure that we implement semantic 
consistent across file formats
+                                //
+                                // NOTE: Other logical types (decimal, date, 
timestamp, etc) will be converted
+                                //       into one of the following types, 
making sure that their corresponding
+                                //       ordering is preserved
                                 "null",
-                                "string"
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "BooleanWrapper",
+                                    "doc": "A record wrapping boolean type to 
be able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": "boolean",
+                                            "name": "value"
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "IntWrapper",
+                                    "doc": "A record wrapping int type to be 
able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": "int",
+                                            "name": "value"
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "LongWrapper",
+                                    "doc": "A record wrapping long type to be 
able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": "long",
+                                            "name": "value"
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "FloatWrapper",
+                                    "doc": "A record wrapping float type to be 
able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": "float",
+                                            "name": "value"
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "DoubleWrapper",
+                                    "doc": "A record wrapping double type to 
be able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": "double",
+                                            "name": "value"
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "BytesWrapper",
+                                    "doc": "A record wrapping bytes type to be 
able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": "bytes",
+                                            "name": "value"
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "StringWrapper",
+                                    "doc": "A record wrapping string type to 
be able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": "string",
+                                            "name": "value"
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "DateWrapper",
+                                    "doc": "A record wrapping Date logical 
type to be able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": {
+                                                "type": "int"
+                                                // NOTE: Due to breaking 
changes in code-gen b/w Avro 1.8.2 and 1.10, we can't
+                                                //       rely on logical types 
to do proper encoding of the native Java types,
+                                                //       and hereby have to 
encode statistic manually
+                                                //"logicalType": "date"
+                                            },
+                                            "name": "value"
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "DecimalWrapper",
+                                    "doc": "A record wrapping Decimal logical 
type to be able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": {
+                                                "type": "bytes",
+                                                "logicalType": "decimal",
+                                                // NOTE: This is equivalent to 
Spark's [[DoubleDecimal]] and should
+                                                //       be enough for almost 
any possible use-cases
+                                                "precision": 30,
+                                                "scale": 15
+                                            },
+                                            "name": "value"
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "TimeMicrosWrapper",
+                                    "doc": "A record wrapping Time-micros 
logical type to be able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": {
+                                                "type": "long",
+                                                "logicalType": "time-micros"
+                                            },
+                                            "name": "value"
+
+                                        }
+                                    ]
+                                },
+                                {
+                                    "namespace": "org.apache.hudi.avro.model",
+                                    "type": "record",
+                                    "name": "TimestampMicrosWrapper",
+                                    "doc": "A record wrapping Timestamp-micros 
logical type to be able to be used it w/in Avro's Union",
+                                    "fields": [
+                                        {
+                                            "type": {
+                                                "type": "long"
+                                                // NOTE: Due to breaking 
changes in code-gen b/w Avro 1.8.2 and 1.10, we can't
+                                                //       rely on logical types 
to do proper encoding of the native Java types,
+                                                //       and hereby have to 
encode statistic manually
+                                                //"logicalType": 
"timestamp-micros"
+                                            },
+                                            "name": "value"
+                                        }
+                                    ]
+                                }
                             ]
                         },
                         {
                             "doc": "Maximum value in the range. Based on user 
data table schema, we can convert it to appropriate type",
                             "name": "maxValue",
                             "type": [
+                                // Those types should be aligned with Parquet 
`Statistics` impl
+                                // making sure that we implement semantic 
consistent across file formats
+                                //
+                                // NOTE: Other logical types (decimal, date, 
timestamp, etc) will be converted
+                                //       into one of the following types, 
making sure that their corresponding
+                                //       ordering is preserved
                                 "null",
-                                "string"
+                                "org.apache.hudi.avro.model.BooleanWrapper",

Review comment:
       similar here.

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -1126,23 +1127,29 @@ public static void aggregateColumnStats(IndexedRecord 
record, List<Schema.Field>
     }
 
     fields.forEach(field -> {
-      Map<String, Object> columnStats = 
columnToStats.getOrDefault(field.name(), new HashMap<>());
-      final String fieldVal = getNestedFieldValAsString((GenericRecord) 
record, field.name(), true, consistentLogicalTimestampEnabled);
+      Map<String, Object> columnStats = columnToStats.get(field.name());
+      GenericRecord genericRecord = (GenericRecord) record;
+      final Object fieldVal = convertValueForSpecificDataTypes(field.schema(), 
genericRecord.get(field.name()), consistentLogicalTimestampEnabled);
+      final Schema fieldSchema = 
getNestedFieldSchemaFromWriteSchema(genericRecord.getSchema(), field.name());
       // update stats
-      final int fieldSize = fieldVal == null ? 0 : fieldVal.length();
-      columnStats.put(TOTAL_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_SIZE, 0).toString()) + fieldSize);
-      columnStats.put(TOTAL_UNCOMPRESSED_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_UNCOMPRESSED_SIZE, 0).toString()) 
+ fieldSize);
+      // NOTE: Unlike Parquet, Avro does not give the field size.
+      columnStats.put(TOTAL_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_SIZE, 0).toString()));
+      columnStats.put(TOTAL_UNCOMPRESSED_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_UNCOMPRESSED_SIZE, 
0).toString()));

Review comment:
       I assume the size here is not treated as accurate so it's okay not to 
add the field size here.

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##########
@@ -1126,23 +1127,29 @@ public static void aggregateColumnStats(IndexedRecord 
record, List<Schema.Field>
     }
 
     fields.forEach(field -> {
-      Map<String, Object> columnStats = 
columnToStats.getOrDefault(field.name(), new HashMap<>());
-      final String fieldVal = getNestedFieldValAsString((GenericRecord) 
record, field.name(), true, consistentLogicalTimestampEnabled);
+      Map<String, Object> columnStats = columnToStats.get(field.name());
+      GenericRecord genericRecord = (GenericRecord) record;
+      final Object fieldVal = convertValueForSpecificDataTypes(field.schema(), 
genericRecord.get(field.name()), consistentLogicalTimestampEnabled);
+      final Schema fieldSchema = 
getNestedFieldSchemaFromWriteSchema(genericRecord.getSchema(), field.name());
       // update stats
-      final int fieldSize = fieldVal == null ? 0 : fieldVal.length();
-      columnStats.put(TOTAL_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_SIZE, 0).toString()) + fieldSize);
-      columnStats.put(TOTAL_UNCOMPRESSED_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_UNCOMPRESSED_SIZE, 0).toString()) 
+ fieldSize);
+      // NOTE: Unlike Parquet, Avro does not give the field size.
+      columnStats.put(TOTAL_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_SIZE, 0).toString()));
+      columnStats.put(TOTAL_UNCOMPRESSED_SIZE, 
Long.parseLong(columnStats.getOrDefault(TOTAL_UNCOMPRESSED_SIZE, 
0).toString()));
 
-      if (!isNullOrEmpty(fieldVal)) {
+      if (fieldVal != null) {
         // set the min value of the field
         if (!columnStats.containsKey(MIN)) {
           columnStats.put(MIN, fieldVal);
         }
-        if (fieldVal.compareTo(String.valueOf(columnStats.get(MIN))) < 0) {
+        if (compare(fieldVal, columnStats.get(MIN), fieldSchema) < 0) {
           columnStats.put(MIN, fieldVal);
         }
         // set the max value of the field
-        if (fieldVal.compareTo(String.valueOf(columnStats.getOrDefault(MAX, 
""))) > 0) {
+        if (!columnStats.containsKey(MAX)) {
+          columnStats.put(MAX, fieldVal);
+        }
+        // set the max value of the field
+        if (compare(fieldVal, columnStats.get(MAX), fieldSchema) > 0) {

Review comment:
       So basically, for struct and nested fields, e.g., "a.b.c" and "a.b.d", 
the logic is always going to construct the column stats for individual nested 
field of "a.b.c" and "a.b.d" which has defined order based on the data type, 
not struct field "a" and field "a.b" which does not have pre-defined order.  Is 
that the case?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to