[jira] [Commented] (PARQUET-682) Configure the encoding used by ValueWriters

2016-08-24 Thread Piyush Narang (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435916#comment-15435916
 ] 

Piyush Narang commented on PARQUET-682:
---

Thanks [~julienledem] that makes sense. I'll file a couple of follow up jiras 
in that case to tackle these two scenarios. For 1, we could do something on the 
lines of what I suggested above:
{code}
parquet.writer.encoding-override.my-column1="plain"
parquet.writer.encoding-override.my-column2="delta_byte_array"
{code}
That seems like the more straightforward scenario to start off with. 

I guess some of the confusion was regarding the approach for 2. Would be nice 
if the heuristic based algorithm was configurable. I guess for something like 
that to ensure we don't use reflection explicitly we would have to have a list 
of well known overrides and map them to actual value writer factories:
{code}
parquet.writer.encoding-override.int32="auto-heuristic-A"
{code}

And somewhere in the code we map auto-heuristic-A to say 
AutoHeuristicValuesWriterFactory. Whenever we add a new heuristic, we update 
the map / enum and allow users to try it out. As an initial example, we could 
maybe provide a heuristic / strategy to allow users to override encoding per 
type. That will give us a simple example in the repo which can be used as a 
template for more sophisticated auto heuristic strategy in the future. (Open to 
a simpler example / heuristic if you'd prefer). 

> Configure the encoding used by ValueWriters
> ---
>
> Key: PARQUET-682
> URL: https://issues.apache.org/jira/browse/PARQUET-682
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Piyush Narang
>
> This was supposed to be tackled by jira: 
> https://issues.apache.org/jira/browse/PARQUET-601 but that ended up being 
> just the work done to refactor the ValuesWriter factory code out of 
> ParquetProperties. As that is now merged, it would be nice to revisit the 
> original purpose - being able to configure which type of ValuesWriters to be 
> used for writing out columns. 
> Background: Parquet is currently structured to choose the appropriate value 
> writer based on the type of the column as well as the Parquet version. Value 
> writers are responsible for writing out values with the appropriate encoding. 
> As an example, for Boolean data types, we use BooleanPlainValuesWriter (v1.0) 
> or RunLengthBitPackingHybridValuesWriter (v2.0). Code to do this is in the 
> [DefaultV1ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L31]
>  and the 
> [DefaultV2ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L35].
>  
> Would be nice to support being able to override the encodings in some way. 
> That allows users to experiment with various encoding strategies manually as 
> well as enables them to override the hardcoded defaults if they don't suit 
> their use case.
> Couple of options I can think of:
> Specifying encoding by type (or column):
> {code}
> parquet.writer.encoding-override. = "encoding1[,encoding2]"
> As an example:
> "parquet.writer.encoding-override.int32" = "plain"
> {code}
> Chooses Plain encoding and hence the PlainValuesWriter.
> When a primary + fallback need to be specified, we can do the following:
> {code}
> "parquet.writer.encoding-override.binary" = "rle_dictionary,delta_byte_array"
> {code}
> Chooses RLE_DICTIONARY encoding as the initial encoding and DELTA_BYTE_ARRAY 
> encoding as the fallback and hence creates a 
> FallbackWriter(PlainBinaryDictionaryValuesWriter, DeltaByteArrayWriter). 
> In such cases we can mandate that the first encoding listed must allow for 
> Fallbacks by implementing 
> [RequiresFallback|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/RequiresFallback.java#L31].
>  
> Another option suggested by [~alexlevenson], was to allow overriding of the 
> ValuesWriterFactory using reflection:
> {code}
> parquet.writer.factory-override = 
> "org.apache.parquet.hadoop.MyValuesWriterFactory"
> {code}
> This creates a factory, MyValuesWriterFactory which is then invoked for every 
> ColumnDescriptor to get a ValueWriter. This provides the flexibility to the 
> user to implement a ValuesWriterFactory that can read configuration for per 
> type / column encoding overrides. Can also be used to plug-in a more 
> sophisticated approach where we choose the appropriate encoding based on the 
> data being seen. A concern raised by [~rdblue] regarding this approach was 
> that ValuesWriters are supposed to be internal classes in Parquet. So we 
> shouldn't be allowing users to 

[jira] [Commented] (PARQUET-682) Configure the encoding used by ValueWriters

2016-08-23 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434083#comment-15434083
 ] 

Julien Le Dem commented on PARQUET-682:
---

I general I think we have 2 use cases:

1) The users have specific knowledge of the data that make them pick a better 
encoding for a given column.
For this we want the override to be by column name rather than type.
Because for example:
 - the user knows that a field will not get dictionary encoded but will perform 
in prefix coding. it will save time/memory not to fallback from dic coding and 
just do prefix coding right away.
 - the user knows that a specific encoding will do better on a given column and 
wants to try it first.
 - the user wants to force dictionary encoding on a certain field (and fail if 
it gets too big) for perf reasons.

2) Tweaking a general heuristic to pick a good encoding unsupervised.
your suggestion seems to apply to this in particular. (override by type)


> Configure the encoding used by ValueWriters
> ---
>
> Key: PARQUET-682
> URL: https://issues.apache.org/jira/browse/PARQUET-682
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Piyush Narang
>
> This was supposed to be tackled by jira: 
> https://issues.apache.org/jira/browse/PARQUET-601 but that ended up being 
> just the work done to refactor the ValuesWriter factory code out of 
> ParquetProperties. As that is now merged, it would be nice to revisit the 
> original purpose - being able to configure which type of ValuesWriters to be 
> used for writing out columns. 
> Background: Parquet is currently structured to choose the appropriate value 
> writer based on the type of the column as well as the Parquet version. Value 
> writers are responsible for writing out values with the appropriate encoding. 
> As an example, for Boolean data types, we use BooleanPlainValuesWriter (v1.0) 
> or RunLengthBitPackingHybridValuesWriter (v2.0). Code to do this is in the 
> [DefaultV1ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java#L31]
>  and the 
> [DefaultV2ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java#L35].
>  
> Would be nice to support being able to override the encodings in some way. 
> That allows users to experiment with various encoding strategies manually as 
> well as enables them to override the hardcoded defaults if they don't suit 
> their use case.
> Couple of options I can think of:
> Specifying encoding by type (or column):
> {code}
> parquet.writer.encoding-override. = "encoding1[,encoding2]"
> As an example:
> "parquet.writer.encoding-override.int32" = "plain"
> {code}
> Chooses Plain encoding and hence the PlainValuesWriter.
> When a primary + fallback need to be specified, we can do the following:
> {code}
> "parquet.writer.encoding-override.binary" = "rle_dictionary,delta_byte_array"
> {code}
> Chooses RLE_DICTIONARY encoding as the initial encoding and DELTA_BYTE_ARRAY 
> encoding as the fallback and hence creates a 
> FallbackWriter(PlainBinaryDictionaryValuesWriter, DeltaByteArrayWriter). 
> In such cases we can mandate that the first encoding listed must allow for 
> Fallbacks by implementing 
> [RequiresFallback|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/RequiresFallback.java#L31].
>  
> Another option suggested by [~alexlevenson], was to allow overriding of the 
> ValuesWriterFactory using reflection:
> {code}
> parquet.writer.factory-override = 
> "org.apache.parquet.hadoop.MyValuesWriterFactory"
> {code}
> This creates a factory, MyValuesWriterFactory which is then invoked for every 
> ColumnDescriptor to get a ValueWriter. This provides the flexibility to the 
> user to implement a ValuesWriterFactory that can read configuration for per 
> type / column encoding overrides. Can also be used to plug-in a more 
> sophisticated approach where we choose the appropriate encoding based on the 
> data being seen. A concern raised by [~rdblue] regarding this approach was 
> that ValuesWriters are supposed to be internal classes in Parquet. So we 
> shouldn't be allowing users to configure the ValuesWriter factories via 
> config.
> cc [~julienledem] / [~rdblue] / [~alexlevenson] for you thoughts / other 
> ideas. We could also explore other ideas based on any other potential use 
> cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)