[
https://issues.apache.org/jira/browse/PARQUET-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434083#comment-15434083
]
Julien Le Dem commented on PARQUET-682:
---------------------------------------
I general I think we have 2 use cases:
1) The users have specific knowledge of the data that make them pick a better
encoding for a given column.
For this we want the override to be by column name rather than type.
Because for example:
- the user knows that a field will not get dictionary encoded but will perform
in prefix coding. it will save time/memory not to fallback from dic coding and
just do prefix coding right away.
- the user knows that a specific encoding will do better on a given column and
wants to try it first.
- the user wants to force dictionary encoding on a certain field (and fail if
it gets too big) for perf reasons.
2) Tweaking a general heuristic to pick a good encoding unsupervised.
your suggestion seems to apply to this in particular. (override by type)
> Configure the encoding used by ValueWriters
> -------------------------------------------
>
> Key: PARQUET-682
> URL: https://issues.apache.org/jira/browse/PARQUET-682
> Project: Parquet
> Issue Type: Improvement
> Reporter: Piyush Narang
>
> This was supposed to be tackled by jira:
> https://issues.apache.org/jira/browse/PARQUET-601 but that ended up being
> just the work done to refactor the ValuesWriter factory code out of
> ParquetProperties. As that is now merged, it would be nice to revisit the
> original purpose - being able to configure which type of ValuesWriters to be
> used for writing out columns.
> Background: Parquet is currently structured to choose the appropriate value
> writer based on the type of the column as well as the Parquet version. Value
> writers are responsible for writing out values with the appropriate encoding.
> As an example, for Boolean data types, we use BooleanPlainValuesWriter (v1.0)
> or RunLengthBitPackingHybridValuesWriter (v2.0). Code to do this is in the
> [DefaultV1ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L31]
> and the
> [DefaultV2ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L35].
>
> Would be nice to support being able to override the encodings in some way.
> That allows users to experiment with various encoding strategies manually as
> well as enables them to override the hardcoded defaults if they don't suit
> their use case.
> Couple of options I can think of:
> Specifying encoding by type (or column):
> {code}
> parquet.writer.encoding-override.<type> = "encoding1[,encoding2]"
> As an example:
> "parquet.writer.encoding-override.int32" = "plain"
> {code}
> Chooses Plain encoding and hence the PlainValuesWriter.
> When a primary + fallback need to be specified, we can do the following:
> {code}
> "parquet.writer.encoding-override.binary" = "rle_dictionary,delta_byte_array"
> {code}
> Chooses RLE_DICTIONARY encoding as the initial encoding and DELTA_BYTE_ARRAY
> encoding as the fallback and hence creates a
> FallbackWriter(PlainBinaryDictionaryValuesWriter, DeltaByteArrayWriter).
> In such cases we can mandate that the first encoding listed must allow for
> Fallbacks by implementing
> [RequiresFallback|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/RequiresFallback.java#L31].
>
> Another option suggested by [~alexlevenson], was to allow overriding of the
> ValuesWriterFactory using reflection:
> {code}
> parquet.writer.factory-override =
> "org.apache.parquet.hadoop.MyValuesWriterFactory"
> {code}
> This creates a factory, MyValuesWriterFactory which is then invoked for every
> ColumnDescriptor to get a ValueWriter. This provides the flexibility to the
> user to implement a ValuesWriterFactory that can read configuration for per
> type / column encoding overrides. Can also be used to plug-in a more
> sophisticated approach where we choose the appropriate encoding based on the
> data being seen. A concern raised by [~rdblue] regarding this approach was
> that ValuesWriters are supposed to be internal classes in Parquet. So we
> shouldn't be allowing users to configure the ValuesWriter factories via
> config.
> cc [~julienledem] / [~rdblue] / [~alexlevenson] for you thoughts / other
> ideas. We could also explore other ideas based on any other potential use
> cases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)