[ 
https://issues.apache.org/jira/browse/PARQUET-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Levenson resolved PARQUET-601.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: format-2.4.0

Issue resolved by pull request 342
[https://github.com/apache/parquet-mr/pull/342]

> Add support in Parquet to configure the encoding used by ValueWriters
> ---------------------------------------------------------------------
>
>                 Key: PARQUET-601
>                 URL: https://issues.apache.org/jira/browse/PARQUET-601
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Piyush Narang
>            Assignee: Piyush Narang
>             Fix For: format-2.4.0
>
>
> Parquet is currently structured to choose the appropriate value writer based 
> on the type of the column as well as the Parquet version. Value writers are 
> responsible for writing out values with the appropriate encoding. As an 
> example, for Boolean data types, we use BooleanPlainValuesWriter (v1.0) or 
> RunLengthBitPackingHybridValuesWriter (v2.0). The code to take these 
> decisions is in 
> [ParquetProperties|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L242].
>  
> Thanks to this set up, the writer(s) (and hence encoding) for each data type 
> is hard coded in the Parquet source code. 
> Would be nice to support being able to override the encodings per type via 
> config. That allows users to experiment with various encoding strategies 
> manually as well as enables them to override the hardcoded defaults if they 
> don't suit their use case. 
> We can override encodings per data type (int32 / int64 / ...). 
> Something on the lines of:
> {code}
> parquet.writer.encoding-override.<type> = "encoding1[,encoding2]"
> {code}
> As an example:
> {code}
> "parquet.writer.encoding-override.int32" = "plain"
> (Chooses Plain encoding and hence the PlainValuesWriter).
> {code}
> When a primary + fallback need to be specified, we can do the following:
> {code}
> "parquet.writer.encoding-override.binary" = "rle_dictionary,delta_byte_array"
> (Chooses RLE_DICTIONARY encoding as the initial encoding and DELTA_BYTE_ARRAY 
> encoding as the fallback and hence creates a 
> FallbackWriter(PlainBinaryDictionaryValuesWriter, DeltaByteArrayWriter). 
> {code}
> In such cases we can mandate that the first encoding listed must allow for 
> Fallbacks by implementing 
> [RequiresFallback|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/RequiresFallback.java#L31].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to