Daniel Becker created PARQUET-1561:
--------------------------------------
Summary: Inconsistencies in the Parquet Delta Encoding
specification
Key: PARQUET-1561
URL: https://issues.apache.org/jira/browse/PARQUET-1561
Project: Parquet
Issue Type: Bug
Components: parquet-format
Reporter: Daniel Becker
There are several imprecise/inconsistent formulations in the specification of
the Parquet Delta Encoding
([https://github.com/apache/parquet-format/blob/master/Encodings.md]).
# In the beginning of the Delta Encoding section, it is written that
{quote}When there are not enough values to encode a full block we pad with
zeros (added to the frame of reference).{quote}
>From the parquet-mr implementation of Delta Encoding
>([https://github.com/apache/parquet-mr/blob/dc61e510126aaa1a95a46fe39bf1529f394147e9/parquet-column/src/main/java/org/apache/parquet/column/values/delta/DeltaBinaryPackingValuesWriterForInteger.java]),
> it seems that when the number of elements does not fill a complete miniblock,
>we do use padding (otherwise the data would not always end on a byte
>boundary), but that short blocks are not padded, i.e. we do not add
>empty/unspecified miniblocks to the block and do not even set the bit width to
>zero for the remaining miniblocks (which is not very good in my opinion). The
>specification should be clearer on this point.
# In the description of the header, it is written that
{quote}the block size is a multiple of 128 stored as VLQ int{quote}
According to Wikipedia, VLQ is big-endian and the corresponding little-endian
encoding us ULEB128 ([https://en.wikipedia.org/wiki/Variable-length_quantity],
[https://en.wikipedia.org/wiki/LEB128]). The parquet-mr implementation uses the
little-endian format. The number encoding is called VLQ in the whole Delta
Encoding specification, not just here.
As the implementaion is already in use, the best would be to update the
specification to match the implementation.
# The next line is:
{quote}the miniblock count per block is a diviser of the block size stored as
VLQ int the number of values in the miniblock is a multiple of 32.{quote}
This should be stylistically improved. Also, divisor is spelled with an ‘o’.
For example:
{quote}the miniblock count per block is a divisor of the block size such that
their quotient, the number of values in a miniblock, is a multiple of 32; it is
stored as a ULEB128 int{quote}
# In the section describing the block:
{quote}the min delta is a VLQ int{quote}
I think it should be more precise and say that the min delta is a zigzag
VLQ/ULEB int as plain VLQ and ULEB are unsigned and the zigzag version is
actually used in parquet-mr.
# Later in the same section:
{quote}Having multiple blocks allows us to escape values and restart from a new
base value.{quote}
The reader may think that in each block, we have a new base value according to
which we compute the delta of the next element, but it is not true. The base
value is the very first value in the page, which is stored in the header. What
the author meant is that we have a new min delta in each block that is the
frame of reference for the deltas in that block (we subtract it from the deltas
to make them non-negative), but in my opinion it is not clear from this
sentence.
# In the section describing the algorithm to encode the values (beginning with
“To encode each delta block...“), in step 2, it says this:
{quote}Encode the first value as zigzag VLQ int{quote}
This is misleading as we do not store the first value of each block as a
VLQ/ULEB int, only the very first value in the page is stored in such a way, in
the header, not in each block. Generally I think the description of the
algorithm could be more straightforward, I find it a little difficult to
understand.
# In the examples, the block sizes are not multiples of 128, but the
specification requires that. Either the examples should be replaced with valid
ones or it should be noted that this is to keep the examples shorter. Also, it
would be useful to include examples with multiple blocks.
# In the ‘Characteristics’ section, miniblock is written in two words, while in
the rest of the specification, it is written as one.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)