[ 
https://issues.apache.org/jira/browse/IMPALA-10879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410497#comment-17410497
 ] 

ASF subversion and git services commented on IMPALA-10879:
----------------------------------------------------------

Commit c8aa5796d93510723342055cc70cf8d00abae754 in impala's branch 
refs/heads/master from Attila Jeges
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=c8aa579 ]

IMPALA-10879: Add parquet stats to iceberg manifest

This patch adds parquet stats to iceberg manifest as per-datafile
metrics.

The following metrics are supported:
- column_sizes :
  Map from column id to the total size on disk of all regions that
  store the column. Does not include bytes necessary to read other
  columns, like footers.

- null_value_counts :
  Map from column id to number of null values in the column.

- lower_bounds :
  Map from column id to lower bound in the column serialized as
  binary. Each value must be less than or equal to all non-null,
  non-NaN values in the column for the file.

- upper_bounds :
  Map from column id to upper bound in the column serialized as
  binary. Each value must be greater than or equal to all non-null,
  non-Nan values in the column for the file.

The corresponding parquet stats are collected by 'ColumnStats'
(in 'min_value_', 'max_value_', 'null_count_' members) and
'HdfsParquetTableWriter::BaseColumnWriter' (in
'total_compressed_byte_size_' member).

Testing:
- New e2e test was added to verify that the metrics are written to the
  Iceberg manifest upon inserting data.
- New e2e test was added to verify that lower_bounds/upper_bounds
  metrics are used to prune data files on querying iceberg tables.
- Existing e2e tests were updated to work with the new behavior.
- BE test for single-value serialization.

Relevant Iceberg documentation:
- Manifest:
  https://iceberg.apache.org/spec/#manifests
- Values in lower_bounds and upper_bounds maps should be Single-value
  serialized to binary:
  https://iceberg.apache.org/spec/#appendix-d-single-value-serialization

Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Reviewed-on: http://gerrit.cloudera.org:8080/17806
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Attila Jeges <[email protected]>


> Add parquet stats to iceberg manifest
> -------------------------------------
>
>                 Key: IMPALA-10879
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10879
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend, Frontend
>    Affects Versions: Impala 4.0.0
>            Reporter: Attila Jeges
>            Assignee: Attila Jeges
>            Priority: Major
>              Labels: impala-iceberg
>
> Parquet stats should be written to iceberg manifest as per-datafile metrics.
> This task is specifically about the following metrics:
> - column_sizes : Map from column id to the total size on disk of all regions 
> that store the column. Does not include bytes necessary to read other 
> columns, like footers. Leave null for row-oriented formats
> - null_value_counts : Map from column id to number of null values in the 
> column.
> - lower_bounds : Map from column id to lower bound in the column serialized 
> as binary. Each value must be less than or equal to all non-null, non-NaN 
> values in the column for the file.
> - upper_bounds : Map from column id to upper bound in the column serialized 
> as binary. Each value must be greater than or equal to all non-null, non-Nan 
> values in the column for the file.
> Iceberg manifest doc: 
> https://iceberg.apache.org/spec/#manifests
> lower_bounds and upper_bounds values should be Single-value serialized to 
> binary:
> https://iceberg.apache.org/spec/#appendix-d-single-value-serialization



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to