[ 
https://issues.apache.org/jira/browse/IMPALA-15106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-15106:
-------------------------------------
    Description: 
Existing theta sketches functions don't work for all types, for example there 
is no support for TIMESTAMP. To use Iceberg's puffin files to store incremental 
NDV stats, we need to support all types currently handled by COMPUTE STATS. The 
Iceberg spec defines clearly how to do this for all types:
https://iceberg.apache.org/puffin-spec/#apache-datasketches-theta-v1-blob-type
https://iceberg.apache.org/spec/#appendix-d-single-value-serialization

Note that we may be unable to use nanosecond timestamps, probably an error 
could be returned in that case.

In the frontend theta functions are added here:
https://github.com/apache/impala/blob/0b8294b1a5b0ad4a817dee13b7fbb2ee53f534e2/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java#L1171

  was:
Existing theta sketches functions don't work for all types, for example there 
is no support for TIMESTAMP. To use Iceberg's puffin files to store incremental 
NDV stats, we need to support all types currently handled by COMPUTE STATS. The 
Iceberg spec defines clearly how to do this for all types:
https://iceberg.apache.org/puffin-spec/#apache-datasketches-theta-v1-blob-type
https://iceberg.apache.org/spec/#appendix-d-single-value-serialization

Note that we may be unable to use nanosecond timestamps, probably an error 
could be returned in that case.


> Support missing types with theta sketches
> -----------------------------------------
>
>                 Key: IMPALA-15106
>                 URL: https://issues.apache.org/jira/browse/IMPALA-15106
>             Project: IMPALA
>          Issue Type: New Feature
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Priority: Major
>              Labels: datasketches, impala-iceberg
>
> Existing theta sketches functions don't work for all types, for example there 
> is no support for TIMESTAMP. To use Iceberg's puffin files to store 
> incremental NDV stats, we need to support all types currently handled by 
> COMPUTE STATS. The Iceberg spec defines clearly how to do this for all types:
> https://iceberg.apache.org/puffin-spec/#apache-datasketches-theta-v1-blob-type
> https://iceberg.apache.org/spec/#appendix-d-single-value-serialization
> Note that we may be unable to use nanosecond timestamps, probably an error 
> could be returned in that case.
> In the frontend theta functions are added here:
> https://github.com/apache/impala/blob/0b8294b1a5b0ad4a817dee13b7fbb2ee53f534e2/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java#L1171



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to