Piyush Narang created PARQUET-682:
-------------------------------------

             Summary: Configure the encoding used by ValueWriters
                 Key: PARQUET-682
                 URL: https://issues.apache.org/jira/browse/PARQUET-682
             Project: Parquet
          Issue Type: Improvement
            Reporter: Piyush Narang


This was supposed to be tackled by jira: 
https://issues.apache.org/jira/browse/PARQUET-601 but that ended up being just 
the work done to refactor the ValuesWriter factory code out of 
ParquetProperties. As that is now merged, it would be nice to revisit the 
original purpose - being able to configure which type of ValuesWriters to be 
used for writing out columns. 

Background: Parquet is currently structured to choose the appropriate value 
writer based on the type of the column as well as the Parquet version. Value 
writers are responsible for writing out values with the appropriate encoding. 
As an example, for Boolean data types, we use BooleanPlainValuesWriter (v1.0) 
or RunLengthBitPackingHybridValuesWriter (v2.0). Code to do this is in the 
[DefaultV1ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L31]
 and the 
[DefaultV2ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L35].
 

Would be nice to support being able to override the encodings in some way. 
That allows users to experiment with various encoding strategies manually as 
well as enables them to override the hardcoded defaults if they don't suit 
their use case.

Couple of options I can think of:

Specifying encoding by type (or column):
{code}
parquet.writer.encoding-override.<type> = "encoding1[,encoding2]"
As an example:
"parquet.writer.encoding-override.int32" = "plain"
{code}
Chooses Plain encoding and hence the PlainValuesWriter.

When a primary + fallback need to be specified, we can do the following:
{code}
"parquet.writer.encoding-override.binary" = "rle_dictionary,delta_byte_array"
{code}

Chooses RLE_DICTIONARY encoding as the initial encoding and DELTA_BYTE_ARRAY 
encoding as the fallback and hence creates a 
FallbackWriter(PlainBinaryDictionaryValuesWriter, DeltaByteArrayWriter). 
In such cases we can mandate that the first encoding listed must allow for 
Fallbacks by implementing 
[RequiresFallback|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/RequiresFallback.java#L31].
 

Another option suggested by [~alexlevenson], was to allow overriding of the 
ValuesWriterFactory using reflection:
{code}
parquet.writer.factory-override = 
"org.apache.parquet.hadoop.MyValuesWriterFactory"
{code}

This creates a factory, MyValuesWriterFactory which is then invoked for every 
ColumnDescriptor to get a ValueWriter. This provides the flexibility to the 
user to implement a ValuesWriterFactory that can read configuration for per 
type / column encoding overrides. Can also be used to plug-in a more 
sophisticated approach where we choose the appropriate encoding based on the 
data being seen. A concern raised by [~rdblue] regarding this approach was that 
ValuesWriters are supposed to be internal classes in Parquet. So we shouldn't 
be allowing users to configure the ValuesWriter factories via config.

cc [~julienledem] / [~rdblue] / [~alexlevenson] for you thoughts / other ideas. 
We could also explore other ideas based on any other potential use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to