Piyush Narang created PARQUET-682:
-------------------------------------
Summary: Configure the encoding used by ValueWriters
Key: PARQUET-682
URL: https://issues.apache.org/jira/browse/PARQUET-682
Project: Parquet
Issue Type: Improvement
Reporter: Piyush Narang
This was supposed to be tackled by jira:
https://issues.apache.org/jira/browse/PARQUET-601 but that ended up being just
the work done to refactor the ValuesWriter factory code out of
ParquetProperties. As that is now merged, it would be nice to revisit the
original purpose - being able to configure which type of ValuesWriters to be
used for writing out columns.
Background: Parquet is currently structured to choose the appropriate value
writer based on the type of the column as well as the Parquet version. Value
writers are responsible for writing out values with the appropriate encoding.
As an example, for Boolean data types, we use BooleanPlainValuesWriter (v1.0)
or RunLengthBitPackingHybridValuesWriter (v2.0). Code to do this is in the
[DefaultV1ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L31]
and the
[DefaultV2ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L35].
Would be nice to support being able to override the encodings in some way.
That allows users to experiment with various encoding strategies manually as
well as enables them to override the hardcoded defaults if they don't suit
their use case.
Couple of options I can think of:
Specifying encoding by type (or column):
{code}
parquet.writer.encoding-override.<type> = "encoding1[,encoding2]"
As an example:
"parquet.writer.encoding-override.int32" = "plain"
{code}
Chooses Plain encoding and hence the PlainValuesWriter.
When a primary + fallback need to be specified, we can do the following:
{code}
"parquet.writer.encoding-override.binary" = "rle_dictionary,delta_byte_array"
{code}
Chooses RLE_DICTIONARY encoding as the initial encoding and DELTA_BYTE_ARRAY
encoding as the fallback and hence creates a
FallbackWriter(PlainBinaryDictionaryValuesWriter, DeltaByteArrayWriter).
In such cases we can mandate that the first encoding listed must allow for
Fallbacks by implementing
[RequiresFallback|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/RequiresFallback.java#L31].
Another option suggested by [~alexlevenson], was to allow overriding of the
ValuesWriterFactory using reflection:
{code}
parquet.writer.factory-override =
"org.apache.parquet.hadoop.MyValuesWriterFactory"
{code}
This creates a factory, MyValuesWriterFactory which is then invoked for every
ColumnDescriptor to get a ValueWriter. This provides the flexibility to the
user to implement a ValuesWriterFactory that can read configuration for per
type / column encoding overrides. Can also be used to plug-in a more
sophisticated approach where we choose the appropriate encoding based on the
data being seen. A concern raised by [~rdblue] regarding this approach was that
ValuesWriters are supposed to be internal classes in Parquet. So we shouldn't
be allowing users to configure the ValuesWriter factories via config.
cc [~julienledem] / [~rdblue] / [~alexlevenson] for you thoughts / other ideas.
We could also explore other ideas based on any other potential use cases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)