[
https://issues.apache.org/jira/browse/IMPALA-10879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410497#comment-17410497
]
ASF subversion and git services commented on IMPALA-10879:
----------------------------------------------------------
Commit c8aa5796d93510723342055cc70cf8d00abae754 in impala's branch
refs/heads/master from Attila Jeges
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=c8aa579 ]
IMPALA-10879: Add parquet stats to iceberg manifest
This patch adds parquet stats to iceberg manifest as per-datafile
metrics.
The following metrics are supported:
- column_sizes :
Map from column id to the total size on disk of all regions that
store the column. Does not include bytes necessary to read other
columns, like footers.
- null_value_counts :
Map from column id to number of null values in the column.
- lower_bounds :
Map from column id to lower bound in the column serialized as
binary. Each value must be less than or equal to all non-null,
non-NaN values in the column for the file.
- upper_bounds :
Map from column id to upper bound in the column serialized as
binary. Each value must be greater than or equal to all non-null,
non-Nan values in the column for the file.
The corresponding parquet stats are collected by 'ColumnStats'
(in 'min_value_', 'max_value_', 'null_count_' members) and
'HdfsParquetTableWriter::BaseColumnWriter' (in
'total_compressed_byte_size_' member).
Testing:
- New e2e test was added to verify that the metrics are written to the
Iceberg manifest upon inserting data.
- New e2e test was added to verify that lower_bounds/upper_bounds
metrics are used to prune data files on querying iceberg tables.
- Existing e2e tests were updated to work with the new behavior.
- BE test for single-value serialization.
Relevant Iceberg documentation:
- Manifest:
https://iceberg.apache.org/spec/#manifests
- Values in lower_bounds and upper_bounds maps should be Single-value
serialized to binary:
https://iceberg.apache.org/spec/#appendix-d-single-value-serialization
Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Reviewed-on: http://gerrit.cloudera.org:8080/17806
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Attila Jeges <[email protected]>
> Add parquet stats to iceberg manifest
> -------------------------------------
>
> Key: IMPALA-10879
> URL: https://issues.apache.org/jira/browse/IMPALA-10879
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend, Frontend
> Affects Versions: Impala 4.0.0
> Reporter: Attila Jeges
> Assignee: Attila Jeges
> Priority: Major
> Labels: impala-iceberg
>
> Parquet stats should be written to iceberg manifest as per-datafile metrics.
> This task is specifically about the following metrics:
> - column_sizes : Map from column id to the total size on disk of all regions
> that store the column. Does not include bytes necessary to read other
> columns, like footers. Leave null for row-oriented formats
> - null_value_counts : Map from column id to number of null values in the
> column.
> - lower_bounds : Map from column id to lower bound in the column serialized
> as binary. Each value must be less than or equal to all non-null, non-NaN
> values in the column for the file.
> - upper_bounds : Map from column id to upper bound in the column serialized
> as binary. Each value must be greater than or equal to all non-null, non-Nan
> values in the column for the file.
> Iceberg manifest doc:
> https://iceberg.apache.org/spec/#manifests
> lower_bounds and upper_bounds values should be Single-value serialized to
> binary:
> https://iceberg.apache.org/spec/#appendix-d-single-value-serialization
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]