Csaba Ringhofer created IMPALA-9578:
---------------------------------------

             Summary: Read/write support for BINARY in Parquet
                 Key: IMPALA-9578
                 URL: https://issues.apache.org/jira/browse/IMPALA-9578
             Project: IMPALA
          Issue Type: Sub-task
            Reporter: Csaba Ringhofer


In Parquet both STRING and BINARY are stored using the same physical type, 
BYTE_ARRAY.

There is a  String annotation among logical types 
(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#string), 
which means UTF-8 encoding (and is of course ignored by Impala).

Both reading and writing should occur the same way as with STRING.

There is one potential difference to consider during writing: in ORC 
BinaryStatistics has no min/max stats (StringStatistics has them). My guess for 
the reason is that binary values are often very large and "random", so it is 
likely for the stats to need a lot of space while never being used successfully 
for filtering. Note that Parquet is a bit different with its per-page 
statistics and can be potentially need even more space for stats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to