[
https://issues.apache.org/jira/browse/PARQUET-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15839660#comment-15839660
]
Uwe L. Korn commented on PARQUET-845:
-------------------------------------
Storagewise it should not make a difference whether you would have an INT8 or
an INT32 physical type. Putting 4 INT8s into a single INT32 actually would
decrease Parquet's efficiency as some of the encoding "tricks" aren't as
effective anymore. (Usually my INT8 columns takes less than a bit per row when
stored in Parquet. )
Or are you maybe talking about a particular API that should return INT8s
instead of INT32s?
> Efficient storage for several INT_8 and INT_16
> ----------------------------------------------
>
> Key: PARQUET-845
> URL: https://issues.apache.org/jira/browse/PARQUET-845
> Project: Parquet
> Issue Type: Wish
> Reporter: Fernando Pereira
> Priority: Minor
>
> In very large datasets, aggregating several INT8 into INT32 fields (or byte
> array) can make a big difference.
> In parquet, efficient algorithms exist for INT32, so if the LogicalType is
> INT_8 the encoded int might take up only one byte.
> However further optimizations could be made by allowing the user to better
> specify the types.
> What about BYTE_ARRAY logical type, backed by FIXED_LEN_BYTE_ARRAY type (or
> eventually INT_32)?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)