[
https://issues.apache.org/jira/browse/PARQUET-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435916#comment-15435916
]
Piyush Narang commented on PARQUET-682:
---
Thanks [~julienledem] that makes sense. I'll file a couple of follow up jiras
in that case to tackle these two scenarios. For 1, we could do something on the
lines of what I suggested above:
{code}
parquet.writer.encoding-override.my-column1="plain"
parquet.writer.encoding-override.my-column2="delta_byte_array"
{code}
That seems like the more straightforward scenario to start off with.
I guess some of the confusion was regarding the approach for 2. Would be nice
if the heuristic based algorithm was configurable. I guess for something like
that to ensure we don't use reflection explicitly we would have to have a list
of well known overrides and map them to actual value writer factories:
{code}
parquet.writer.encoding-override.int32="auto-heuristic-A"
{code}
And somewhere in the code we map auto-heuristic-A to say
AutoHeuristicValuesWriterFactory. Whenever we add a new heuristic, we update
the map / enum and allow users to try it out. As an initial example, we could
maybe provide a heuristic / strategy to allow users to override encoding per
type. That will give us a simple example in the repo which can be used as a
template for more sophisticated auto heuristic strategy in the future. (Open to
a simpler example / heuristic if you'd prefer).
> Configure the encoding used by ValueWriters
> ---
>
> Key: PARQUET-682
> URL: https://issues.apache.org/jira/browse/PARQUET-682
> Project: Parquet
> Issue Type: Improvement
>Reporter: Piyush Narang
>
> This was supposed to be tackled by jira:
> https://issues.apache.org/jira/browse/PARQUET-601 but that ended up being
> just the work done to refactor the ValuesWriter factory code out of
> ParquetProperties. As that is now merged, it would be nice to revisit the
> original purpose - being able to configure which type of ValuesWriters to be
> used for writing out columns.
> Background: Parquet is currently structured to choose the appropriate value
> writer based on the type of the column as well as the Parquet version. Value
> writers are responsible for writing out values with the appropriate encoding.
> As an example, for Boolean data types, we use BooleanPlainValuesWriter (v1.0)
> or RunLengthBitPackingHybridValuesWriter (v2.0). Code to do this is in the
> [DefaultV1ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L31]
> and the
> [DefaultV2ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L35].
>
> Would be nice to support being able to override the encodings in some way.
> That allows users to experiment with various encoding strategies manually as
> well as enables them to override the hardcoded defaults if they don't suit
> their use case.
> Couple of options I can think of:
> Specifying encoding by type (or column):
> {code}
> parquet.writer.encoding-override. = "encoding1[,encoding2]"
> As an example:
> "parquet.writer.encoding-override.int32" = "plain"
> {code}
> Chooses Plain encoding and hence the PlainValuesWriter.
> When a primary + fallback need to be specified, we can do the following:
> {code}
> "parquet.writer.encoding-override.binary" = "rle_dictionary,delta_byte_array"
> {code}
> Chooses RLE_DICTIONARY encoding as the initial encoding and DELTA_BYTE_ARRAY
> encoding as the fallback and hence creates a
> FallbackWriter(PlainBinaryDictionaryValuesWriter, DeltaByteArrayWriter).
> In such cases we can mandate that the first encoding listed must allow for
> Fallbacks by implementing
> [RequiresFallback|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/RequiresFallback.java#L31].
>
> Another option suggested by [~alexlevenson], was to allow overriding of the
> ValuesWriterFactory using reflection:
> {code}
> parquet.writer.factory-override =
> "org.apache.parquet.hadoop.MyValuesWriterFactory"
> {code}
> This creates a factory, MyValuesWriterFactory which is then invoked for every
> ColumnDescriptor to get a ValueWriter. This provides the flexibility to the
> user to implement a ValuesWriterFactory that can read configuration for per
> type / column encoding overrides. Can also be used to plug-in a more
> sophisticated approach where we choose the appropriate encoding based on the
> data being seen. A concern raised by [~rdblue] regarding this approach was
> that ValuesWriters are supposed to be internal classes in Parquet. So we
> shouldn't be allowing users to