Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )
Change subject: IMPALA-10879: Add parquet stats to iceberg manifest ...................................................................... Patch Set 2: (4 comments) Did a first round, the change looks great! Thanks for adding so many tests. http://gerrit.cloudera.org:8080/#/c/17806/2/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/17806/2/be/src/exec/parquet/hdfs-parquet-table-writer.cc@1723 PS2, Line 1723: col_name Instead of 'col_name' we could use the Iceberg field_id. E.g. we retrieve it here as well: https://github.com/apache/impala/blob/237ed5e8738ec565bc8d3ce813d9b70c12ad4ce7/be/src/exec/parquet/hdfs-parquet-table-writer.cc#L1470 It would simplify the code in IcebergCatalogOpExecutor. It's also safer because someone might renames a column in parallel. http://gerrit.cloudera.org:8080/#/c/17806/2/be/src/exec/parquet/parquet-column-stats.inline.h File be/src/exec/parquet/parquet-column-stats.inline.h: http://gerrit.cloudera.org:8080/#/c/17806/2/be/src/exec/parquet/parquet-column-stats.inline.h@369 PS2, Line 369: out->assign(data, sizeof(big_endian_val)); The spec says "using the minimum number of bytes for the value", but we are writing number of bytes corresponding to the data type. Iceberg uses BigInteger.toByteArray(): https://github.com/apache/iceberg/blob/90225d6c9413016d611e2ce5eff37db1bc1b4fc5/api/src/main/java/org/apache/iceberg/types/Conversions.java#L121 which only writes a single byte for the value 42. https://onlinegdb.com/i2QS7vqI3 But I'm OK with it if it's OK for Iceberg. http://gerrit.cloudera.org:8080/#/c/17806/2/common/fbs/IcebergObjects.fbs File common/fbs/IcebergObjects.fbs: http://gerrit.cloudera.org:8080/#/c/17806/2/common/fbs/IcebergObjects.fbs@31 PS2, Line 31: total_compressed_byte_size: FbLongStat; : null_count: FbLongStat; Can these be just longs? Like 'record_count' in L40? http://gerrit.cloudera.org:8080/#/c/17806/2/common/protobuf/control_service.proto File common/protobuf/control_service.proto: http://gerrit.cloudera.org:8080/#/c/17806/2/common/protobuf/control_service.proto@33 PS2, Line 33: // Iceberg per column statistics. : message IcebergDmlColumnStatsPB { : // The on disk byte size. : optional int64 column_size = 1; : optional int64 null_count = 2; : // min and max stats are single-value serialized. : optional string min_binary = 3; : optional string max_binary = 4; : } I wonder if we could create the Iceberg flat buffers at the data sinks again. Like it was in https://gerrit.cloudera.org/#/c/16545/ That way we wouldn't need to duplicate the stats struct in protobuf. But I'm OK with the current solution if changing it would bring more complexity elsewhere. -- To view, visit http://gerrit.cloudera.org:8080/17806 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b Gerrit-Change-Number: 17806 Gerrit-PatchSet: 2 Gerrit-Owner: Attila Jeges <[email protected]> Gerrit-Reviewer: Attila Jeges <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Wed, 25 Aug 2021 18:54:50 +0000 Gerrit-HasComments: Yes
