[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Zoltan Borok-Nagy (Code Review) Wed, 25 Aug 2021 11:55:00 -0700

Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17806 )


Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 2:

(4 comments)

Did a first round, the change looks great!
Thanks for adding so many tests.

http://gerrit.cloudera.org:8080/#/c/17806/2/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/17806/2/be/src/exec/parquet/hdfs-parquet-table-writer.cc@1723
PS2, Line 1723: col_name
Instead of 'col_name' we could use the Iceberg field_id.

E.g. we retrieve it here as well:
https://github.com/apache/impala/blob/237ed5e8738ec565bc8d3ce813d9b70c12ad4ce7/be/src/exec/parquet/hdfs-parquet-table-writer.cc#L1470

It would simplify the code in IcebergCatalogOpExecutor. It's also safer because 
someone might renames a column in parallel.


http://gerrit.cloudera.org:8080/#/c/17806/2/be/src/exec/parquet/parquet-column-stats.inline.h
File be/src/exec/parquet/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/17806/2/be/src/exec/parquet/parquet-column-stats.inline.h@369
PS2, Line 369: out->assign(data, sizeof(big_endian_val));
The spec says "using the minimum number of bytes for the value", but we are 
writing number of bytes corresponding to the data type.

Iceberg uses BigInteger.toByteArray():
https://github.com/apache/iceberg/blob/90225d6c9413016d611e2ce5eff37db1bc1b4fc5/api/src/main/java/org/apache/iceberg/types/Conversions.java#L121

which only writes a single byte for the value 42.
https://onlinegdb.com/i2QS7vqI3

But I'm OK with it if it's OK for Iceberg.


http://gerrit.cloudera.org:8080/#/c/17806/2/common/fbs/IcebergObjects.fbs
File common/fbs/IcebergObjects.fbs:

http://gerrit.cloudera.org:8080/#/c/17806/2/common/fbs/IcebergObjects.fbs@31
PS2, Line 31:   total_compressed_byte_size: FbLongStat;
            :   null_count: FbLongStat;
Can these be just longs? Like 'record_count' in L40?


http://gerrit.cloudera.org:8080/#/c/17806/2/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/17806/2/common/protobuf/control_service.proto@33
PS2, Line 33: // Iceberg per column statistics.
            : message IcebergDmlColumnStatsPB {
            :   // The on disk byte size.
            :   optional int64 column_size = 1;
            :   optional int64 null_count = 2;
            :   // min and max stats are single-value serialized.
            :   optional string min_binary = 3;
            :   optional string max_binary = 4;
            : }
I wonder if we could create the Iceberg flat buffers at the data sinks again. 
Like it was in https://gerrit.cloudera.org/#/c/16545/

That way we wouldn't need to duplicate the stats struct in protobuf. But I'm OK 
with the current solution if changing it would bring more complexity elsewhere.



--
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 2
Gerrit-Owner: Attila Jeges <[email protected]>
Gerrit-Reviewer: Attila Jeges <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Wed, 25 Aug 2021 18:54:50 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Reply via email to