[
https://issues.apache.org/jira/browse/PARQUET-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alex Levenson resolved PARQUET-601.
-----------------------------------
Resolution: Fixed
Fix Version/s: format-2.4.0
Issue resolved by pull request 342
[https://github.com/apache/parquet-mr/pull/342]
> Add support in Parquet to configure the encoding used by ValueWriters
> ---------------------------------------------------------------------
>
> Key: PARQUET-601
> URL: https://issues.apache.org/jira/browse/PARQUET-601
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Reporter: Piyush Narang
> Assignee: Piyush Narang
> Fix For: format-2.4.0
>
>
> Parquet is currently structured to choose the appropriate value writer based
> on the type of the column as well as the Parquet version. Value writers are
> responsible for writing out values with the appropriate encoding. As an
> example, for Boolean data types, we use BooleanPlainValuesWriter (v1.0) or
> RunLengthBitPackingHybridValuesWriter (v2.0). The code to take these
> decisions is in
> [ParquetProperties|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L242].
>
> Thanks to this set up, the writer(s) (and hence encoding) for each data type
> is hard coded in the Parquet source code.
> Would be nice to support being able to override the encodings per type via
> config. That allows users to experiment with various encoding strategies
> manually as well as enables them to override the hardcoded defaults if they
> don't suit their use case.
> We can override encodings per data type (int32 / int64 / ...).
> Something on the lines of:
> {code}
> parquet.writer.encoding-override.<type> = "encoding1[,encoding2]"
> {code}
> As an example:
> {code}
> "parquet.writer.encoding-override.int32" = "plain"
> (Chooses Plain encoding and hence the PlainValuesWriter).
> {code}
> When a primary + fallback need to be specified, we can do the following:
> {code}
> "parquet.writer.encoding-override.binary" = "rle_dictionary,delta_byte_array"
> (Chooses RLE_DICTIONARY encoding as the initial encoding and DELTA_BYTE_ARRAY
> encoding as the fallback and hence creates a
> FallbackWriter(PlainBinaryDictionaryValuesWriter, DeltaByteArrayWriter).
> {code}
> In such cases we can mandate that the first encoding listed must allow for
> Fallbacks by implementing
> [RequiresFallback|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/RequiresFallback.java#L31].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)