[ 
https://issues.apache.org/jira/browse/PARQUET-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434083#comment-15434083
 ] 

Julien Le Dem commented on PARQUET-682:
---------------------------------------

I general I think we have 2 use cases:

1) The users have specific knowledge of the data that make them pick a better 
encoding for a given column.
For this we want the override to be by column name rather than type.
Because for example:
 - the user knows that a field will not get dictionary encoded but will perform 
in prefix coding. it will save time/memory not to fallback from dic coding and 
just do prefix coding right away.
 - the user knows that a specific encoding will do better on a given column and 
wants to try it first.
 - the user wants to force dictionary encoding on a certain field (and fail if 
it gets too big) for perf reasons.

2) Tweaking a general heuristic to pick a good encoding unsupervised.
your suggestion seems to apply to this in particular. (override by type)


> Configure the encoding used by ValueWriters
> -------------------------------------------
>
>                 Key: PARQUET-682
>                 URL: https://issues.apache.org/jira/browse/PARQUET-682
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Piyush Narang
>
> This was supposed to be tackled by jira: 
> https://issues.apache.org/jira/browse/PARQUET-601 but that ended up being 
> just the work done to refactor the ValuesWriter factory code out of 
> ParquetProperties. As that is now merged, it would be nice to revisit the 
> original purpose - being able to configure which type of ValuesWriters to be 
> used for writing out columns. 
> Background: Parquet is currently structured to choose the appropriate value 
> writer based on the type of the column as well as the Parquet version. Value 
> writers are responsible for writing out values with the appropriate encoding. 
> As an example, for Boolean data types, we use BooleanPlainValuesWriter (v1.0) 
> or RunLengthBitPackingHybridValuesWriter (v2.0). Code to do this is in the 
> [DefaultV1ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L31]
>  and the 
> [DefaultV2ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L35].
>  
> Would be nice to support being able to override the encodings in some way. 
> That allows users to experiment with various encoding strategies manually as 
> well as enables them to override the hardcoded defaults if they don't suit 
> their use case.
> Couple of options I can think of:
> Specifying encoding by type (or column):
> {code}
> parquet.writer.encoding-override.<type> = "encoding1[,encoding2]"
> As an example:
> "parquet.writer.encoding-override.int32" = "plain"
> {code}
> Chooses Plain encoding and hence the PlainValuesWriter.
> When a primary + fallback need to be specified, we can do the following:
> {code}
> "parquet.writer.encoding-override.binary" = "rle_dictionary,delta_byte_array"
> {code}
> Chooses RLE_DICTIONARY encoding as the initial encoding and DELTA_BYTE_ARRAY 
> encoding as the fallback and hence creates a 
> FallbackWriter(PlainBinaryDictionaryValuesWriter, DeltaByteArrayWriter). 
> In such cases we can mandate that the first encoding listed must allow for 
> Fallbacks by implementing 
> [RequiresFallback|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/RequiresFallback.java#L31].
>  
> Another option suggested by [~alexlevenson], was to allow overriding of the 
> ValuesWriterFactory using reflection:
> {code}
> parquet.writer.factory-override = 
> "org.apache.parquet.hadoop.MyValuesWriterFactory"
> {code}
> This creates a factory, MyValuesWriterFactory which is then invoked for every 
> ColumnDescriptor to get a ValueWriter. This provides the flexibility to the 
> user to implement a ValuesWriterFactory that can read configuration for per 
> type / column encoding overrides. Can also be used to plug-in a more 
> sophisticated approach where we choose the appropriate encoding based on the 
> data being seen. A concern raised by [~rdblue] regarding this approach was 
> that ValuesWriters are supposed to be internal classes in Parquet. So we 
> shouldn't be allowing users to configure the ValuesWriter factories via 
> config.
> cc [~julienledem] / [~rdblue] / [~alexlevenson] for you thoughts / other 
> ideas. We could also explore other ideas based on any other potential use 
> cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to