[ 
https://issues.apache.org/jira/browse/IMPALA-15004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081744#comment-18081744
 ] 

Csaba Ringhofer edited comment on IMPALA-15004 at 5/18/26 4:04 PM:
-------------------------------------------------------------------

I am not sure if writing Theta is the best way. If we already need to write 
other stats to a custom blob, then we could write Impala's battle tested HLL 
arrays.
So I see two ways:
1. write standard Iceberg Theta sketches for ndv + an Impala specific one with 
other stats
2. write an Impala specific sketch

The benefit of 1 is interoperability, but this also means that we must ensure 
that Impala writes compatible sketches with other engines for each datatypes.

Another thing that makes 1 harder is that data sketch implementation doesn't 
support all types in Impala: 
https://github.com/apache/impala/blob/e1ca23d627532bb17228e3d455c55a03b3e28f49/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java#L429
currently only TINYINT/INT/BIGINT/FLOAT/DOUBLE are supported.


was (Author: csringhofer):
I am not sure if writing Theta is the best way. If we already need to write 
other stats to a custom blob, then we could write Impala's battle tested HLL 
arrays.
So I see two ways:
1. write standard Iceberg Theta sketches for ndv + an Impala specific one with 
other stats
2. write an Impala specific sketch

The benefit of 1 is interoperability, but this also means that we must ensure 
that Impala writes compatible sketches with other engines for each datatypes.

> Puffin stats writer for Iceberg tables
> --------------------------------------
>
>                 Key: IMPALA-15004
>                 URL: https://issues.apache.org/jira/browse/IMPALA-15004
>             Project: IMPALA
>          Issue Type: New Feature
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Mihaly Szjatinya
>            Priority: Major
>              Labels: impala-iceberg, impala-iceberg-active-backlog
>
> Currently COMPUTE STATS store column statistics only in HMS.
> Iceberg has Puffin files for this purpose, but currently there's only a 
> single blob type (Apache Theta sketches) we can store that only supports NDV.
> Impala should comply to Iceberg's standards and write Puffin files. The stats 
> that cannot be stored in well-known Iceberg Puffin blob types could be stored 
> in custom Impala blobs. That way all statistics information could be 
> retrieved from a single place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to