[
https://issues.apache.org/jira/browse/PARQUET-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377382#comment-16377382
]
Ryan Blue edited comment on PARQUET-796 at 2/26/18 7:06 PM:
------------------------------------------------------------
I don't recommend using the delta long encoding because I think we need to
update to better encodings (specifically, the zig-zag-encoding ones in [this
branch|https://github.com/rdblue/parquet-mr/commits/encoders]).
We could definitely use a better fallback, but I don't think the solution is to
turn off dictionary encoding. If you can use dictionary encoding to get a
smaller size, you should. The problem is when dictionary encoding needs to test
whether another encoding would be better. It currently tests against plain and
uses plain. We should have it test against a delta encoding and use one.
This kind of improvement is why we added PARQUET-601. We want to be able to
test out different ways of choosing an encoding at write time. But we do not
want to make it so that users must specify their own encodings because we want
Parquet to select them automatically and get the choice right. PARQUET-601 is
about testing out strategies that we release as the defaults.
was (Author: rdblue):
I don't recommend using the delta long encoding because I think we need to
update to better encodings (specifically, the zig-zag-encoding ones in this
branch).
We could definitely use a better fallback, but I don't think the solution is to
turn off dictionary encoding. If you can use dictionary encoding to get a
smaller size, you should. The problem is when dictionary encoding needs to test
whether another encoding would be better. It currently tests against plain and
uses plain. We should have it test against a delta encoding and use one.
This kind of improvement is why we added PARQUET-601. We want to be able to
test out different ways of choosing an encoding at write time. But we do not
want to make it so that users must specify their own encodings because we want
Parquet to select them automatically and get the choice right. PARQUET-601 is
about testing out strategies that we release as the defaults.
> Delta Encoding is not used when dictionary enabled
> --------------------------------------------------
>
> Key: PARQUET-796
> URL: https://issues.apache.org/jira/browse/PARQUET-796
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.9.0
> Reporter: Jakub Liska
> Priority: Critical
> Fix For: 1.9.1
>
>
> Current code doesn't enable using both Delta Encoding and Dictionary
> Encoding. If I instantiate ParquetWriter like this :
> {code}
> val writer = new ParquetWriter[Group](outFile, new GroupWriteSupport, codec,
> blockSize, pageSize, dictPageSize, enableDictionary = true, true,
> ParquetProperties.WriterVersion.PARQUET_2_0, configuration)
> {code}
> Then this piece of code :
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultValuesWriterFactory.java#L78-L86
> Causes that DictionaryValuesWriter is used instead of the inferred
> DeltaLongEncodingWriter.
> The original issue is here :
> https://github.com/apache/parquet-mr/pull/154#issuecomment-266489768
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)