[ https://issues.apache.org/jira/browse/PARQUET-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434083#comment-15434083 ]
Julien Le Dem commented on PARQUET-682: --------------------------------------- I general I think we have 2 use cases: 1) The users have specific knowledge of the data that make them pick a better encoding for a given column. For this we want the override to be by column name rather than type. Because for example: - the user knows that a field will not get dictionary encoded but will perform in prefix coding. it will save time/memory not to fallback from dic coding and just do prefix coding right away. - the user knows that a specific encoding will do better on a given column and wants to try it first. - the user wants to force dictionary encoding on a certain field (and fail if it gets too big) for perf reasons. 2) Tweaking a general heuristic to pick a good encoding unsupervised. your suggestion seems to apply to this in particular. (override by type) > Configure the encoding used by ValueWriters > ------------------------------------------- > > Key: PARQUET-682 > URL: https://issues.apache.org/jira/browse/PARQUET-682 > Project: Parquet > Issue Type: Improvement > Reporter: Piyush Narang > > This was supposed to be tackled by jira: > https://issues.apache.org/jira/browse/PARQUET-601 but that ended up being > just the work done to refactor the ValuesWriter factory code out of > ParquetProperties. As that is now merged, it would be nice to revisit the > original purpose - being able to configure which type of ValuesWriters to be > used for writing out columns. > Background: Parquet is currently structured to choose the appropriate value > writer based on the type of the column as well as the Parquet version. Value > writers are responsible for writing out values with the appropriate encoding. > As an example, for Boolean data types, we use BooleanPlainValuesWriter (v1.0) > or RunLengthBitPackingHybridValuesWriter (v2.0). Code to do this is in the > [DefaultV1ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L31] > and the > [DefaultV2ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L35]. > > Would be nice to support being able to override the encodings in some way. > That allows users to experiment with various encoding strategies manually as > well as enables them to override the hardcoded defaults if they don't suit > their use case. > Couple of options I can think of: > Specifying encoding by type (or column): > {code} > parquet.writer.encoding-override.<type> = "encoding1[,encoding2]" > As an example: > "parquet.writer.encoding-override.int32" = "plain" > {code} > Chooses Plain encoding and hence the PlainValuesWriter. > When a primary + fallback need to be specified, we can do the following: > {code} > "parquet.writer.encoding-override.binary" = "rle_dictionary,delta_byte_array" > {code} > Chooses RLE_DICTIONARY encoding as the initial encoding and DELTA_BYTE_ARRAY > encoding as the fallback and hence creates a > FallbackWriter(PlainBinaryDictionaryValuesWriter, DeltaByteArrayWriter). > In such cases we can mandate that the first encoding listed must allow for > Fallbacks by implementing > [RequiresFallback|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/RequiresFallback.java#L31]. > > Another option suggested by [~alexlevenson], was to allow overriding of the > ValuesWriterFactory using reflection: > {code} > parquet.writer.factory-override = > "org.apache.parquet.hadoop.MyValuesWriterFactory" > {code} > This creates a factory, MyValuesWriterFactory which is then invoked for every > ColumnDescriptor to get a ValueWriter. This provides the flexibility to the > user to implement a ValuesWriterFactory that can read configuration for per > type / column encoding overrides. Can also be used to plug-in a more > sophisticated approach where we choose the appropriate encoding based on the > data being seen. A concern raised by [~rdblue] regarding this approach was > that ValuesWriters are supposed to be internal classes in Parquet. So we > shouldn't be allowing users to configure the ValuesWriter factories via > config. > cc [~julienledem] / [~rdblue] / [~alexlevenson] for you thoughts / other > ideas. We could also explore other ideas based on any other potential use > cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)