[
https://issues.apache.org/jira/browse/IMPALA-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Csaba Ringhofer updated IMPALA-9578:
------------------------------------
Labels: parquet (was: )
> Read/write support for BINARY in Parquet
> ----------------------------------------
>
> Key: IMPALA-9578
> URL: https://issues.apache.org/jira/browse/IMPALA-9578
> Project: IMPALA
> Issue Type: Sub-task
> Components: Backend
> Reporter: Csaba Ringhofer
> Priority: Major
> Labels: parquet
>
> In Parquet both STRING and BINARY are stored using the same physical type,
> BYTE_ARRAY.
> There is a String annotation among logical types
> (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#string),
> which means UTF-8 encoding (and is of course ignored by Impala).
> Both reading and writing should occur the same way as with STRING.
> There is one potential difference to consider during writing: in ORC
> BinaryStatistics has no min/max stats (StringStatistics has them). My guess
> for the reason is that binary values are often very large and "random", so it
> is likely for the stats to need a lot of space while never being used
> successfully for filtering. Note that Parquet is a bit different with its
> per-page statistics and can be potentially need even more space for stats.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]