[jira] [Comment Edited] (PARQUET-796) Delta Encoding is not used when dictionary enabled

Ryan Blue (JIRA) Mon, 26 Feb 2018 11:07:22 -0800

    [ 
https://issues.apache.org/jira/browse/PARQUET-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377382#comment-16377382
 ]


Ryan Blue edited comment on PARQUET-796 at 2/26/18 7:06 PM:
------------------------------------------------------------

I don't recommend using the delta long encoding because I think we need to 
update to better encodings (specifically, the zig-zag-encoding ones in [this 
branch|https://github.com/rdblue/parquet-mr/commits/encoders]).

We could definitely use a better fallback, but I don't think the solution is to 
turn off dictionary encoding. If you can use dictionary encoding to get a 
smaller size, you should. The problem is when dictionary encoding needs to test 
whether another encoding would be better. It currently tests against plain and 
uses plain. We should have it test against a delta encoding and use one.

This kind of improvement is why we added PARQUET-601. We want to be able to 
test out different ways of choosing an encoding at write time. But we do not 
want to make it so that users must specify their own encodings because we want 
Parquet to select them automatically and get the choice right. PARQUET-601 is 
about testing out strategies that we release as the defaults.


was (Author: rdblue):
I don't recommend using the delta long encoding because I think we need to 
update to better encodings (specifically, the zig-zag-encoding ones in this 
branch).

We could definitely use a better fallback, but I don't think the solution is to 
turn off dictionary encoding. If you can use dictionary encoding to get a 
smaller size, you should. The problem is when dictionary encoding needs to test 
whether another encoding would be better. It currently tests against plain and 
uses plain. We should have it test against a delta encoding and use one.

This kind of improvement is why we added PARQUET-601. We want to be able to 
test out different ways of choosing an encoding at write time. But we do not 
want to make it so that users must specify their own encodings because we want 
Parquet to select them automatically and get the choice right. PARQUET-601 is 
about testing out strategies that we release as the defaults.

> Delta Encoding is not used when dictionary enabled
> --------------------------------------------------
>
>                 Key: PARQUET-796
>                 URL: https://issues.apache.org/jira/browse/PARQUET-796
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.9.0
>            Reporter: Jakub Liska
>            Priority: Critical
>             Fix For: 1.9.1
>
>
> Current code doesn't enable using both Delta Encoding and Dictionary 
> Encoding. If I instantiate ParquetWriter like this : 
> {code}
> val writer = new ParquetWriter[Group](outFile, new GroupWriteSupport, codec, 
> blockSize, pageSize, dictPageSize, enableDictionary = true, true, 
> ParquetProperties.WriterVersion.PARQUET_2_0, configuration)
> {code}
> Then this piece of code : 
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultValuesWriterFactory.java#L78-L86
> Causes that DictionaryValuesWriter is used instead of the inferred 
> DeltaLongEncodingWriter. 
> The original issue is here : 
> https://github.com/apache/parquet-mr/pull/154#issuecomment-266489768



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (PARQUET-796) Delta Encoding is not used when dictionary enabled

Reply via email to