[ 
https://issues.apache.org/jira/browse/PARQUET-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piyush Narang updated PARQUET-601:
----------------------------------
    Description: 
Parquet is currently structured to choose the appropriate value writer based on 
the type of the column as well as the Parquet version. Value writers are 
responsible for writing out values with the appropriate encoding. As an 
example, for Boolean data types, we use BooleanPlainValuesWriter (v1.0) or 
RunLengthBitPackingHybridValuesWriter (v2.0). The code to take these decisions 
is in 
[ParquetProperties|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L242].
 

Thanks to this set up, the writer(s) (and hence encoding) for each data type is 
hard coded in the Parquet source code. 

Would be nice to support being able to override the encodings per type via 
config. That allows users to experiment with various encoding strategies 
manually as well as enables them to override the hardcoded defaults if they 
don't suit their use case. 

We can override encodings per data type (int32 / int64 / ...). 
Something on the lines of:
{code}
parquet.writer.encoding-override.<type> = "encoding1[,encoding2]"
{code}

As an example:
{code}
"parquet.writer.encoding-override.int32" = "plain"
(Chooses Plain encoding and hence the PlainValuesWriter).
{code}

When a primary + fallback need to be specified, we can do the following:
{code}
"parquet.writer.encoding-override.binary" = "rle_dictionary,delta_byte_array"
(Chooses RLE_DICTIONARY encoding as the initial encoding and DELTA_BYTE_ARRAY 
encoding as the fallback and hence creates a 
FallbackWriter(PlainBinaryDictionaryValuesWriter, DeltaByteArrayWriter). 
{code}

In such cases we can mandate that the first encoding listed must allow for 
Fallbacks by implementing 
[RequiresFallback|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/RequiresFallback.java#L31].

  was:
Parquet is currently structured to choose the appropriate value writer based on 
the type of the column as well as the Parquet version. Value writers are 
responsible for writing out values with the appropriate encoding. As an 
example, for Boolean data types, we use BooleanPlainValuesWriter (v1.0) or 
RunLengthBitPackingHybridValuesWriter (v2.0). The code to take these decisions 
is in 
ParquetProperties(https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L242).
 

Thanks to this set up, the writer(s) (and hence encoding) for each data type is 
hard coded in the Parquet source code. 

Would be nice to support being able to override the encodings per type via 
config. That allows users to experiment with various encoding strategies 
manually as well as enables them to override the hardcoded defaults if they 
don't suit their use case. 

We can override encodings per data type (int32 / int64 / ...). 
Something on the lines of:
parquet.writer.encoding-override.<type> = "encoding1[,encoding2]"
 
As an example:
"parquet.writer.encoding-override.int32" = "plain"
(Chooses Plain encoding and hence the PlainValuesWriter).

When a primary + fallback need to be specified, we can do the following:
"parquet.writer.encoding-override.binary" = "rle_dictionary,delta_byte_array"
(Chooses RLE_DICTIONARY encoding as the initial encoding and DELTA_BYTE_ARRAY 
encoding as the fallback and hence creates a 
FallbackWriter(PlainBinaryDictionaryValuesWriter, DeltaByteArrayWriter). 

In such cases we can mandate that the first encoding listed must allow for 
Fallbacks by implementing 
RequiresFallback(https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/RequiresFallback.java#L31).


> Add support in Parquet to configure the encoding used by ValueWriters
> ---------------------------------------------------------------------
>
>                 Key: PARQUET-601
>                 URL: https://issues.apache.org/jira/browse/PARQUET-601
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>            Reporter: Piyush Narang
>            Assignee: Piyush Narang
>
> Parquet is currently structured to choose the appropriate value writer based 
> on the type of the column as well as the Parquet version. Value writers are 
> responsible for writing out values with the appropriate encoding. As an 
> example, for Boolean data types, we use BooleanPlainValuesWriter (v1.0) or 
> RunLengthBitPackingHybridValuesWriter (v2.0). The code to take these 
> decisions is in 
> [ParquetProperties|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L242].
>  
> Thanks to this set up, the writer(s) (and hence encoding) for each data type 
> is hard coded in the Parquet source code. 
> Would be nice to support being able to override the encodings per type via 
> config. That allows users to experiment with various encoding strategies 
> manually as well as enables them to override the hardcoded defaults if they 
> don't suit their use case. 
> We can override encodings per data type (int32 / int64 / ...). 
> Something on the lines of:
> {code}
> parquet.writer.encoding-override.<type> = "encoding1[,encoding2]"
> {code}
> As an example:
> {code}
> "parquet.writer.encoding-override.int32" = "plain"
> (Chooses Plain encoding and hence the PlainValuesWriter).
> {code}
> When a primary + fallback need to be specified, we can do the following:
> {code}
> "parquet.writer.encoding-override.binary" = "rle_dictionary,delta_byte_array"
> (Chooses RLE_DICTIONARY encoding as the initial encoding and DELTA_BYTE_ARRAY 
> encoding as the fallback and hence creates a 
> FallbackWriter(PlainBinaryDictionaryValuesWriter, DeltaByteArrayWriter). 
> {code}
> In such cases we can mandate that the first encoding listed must allow for 
> Fallbacks by implementing 
> [RequiresFallback|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/RequiresFallback.java#L31].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to