Daniel Becker created PARQUET-1561:
--------------------------------------

             Summary: Inconsistencies in the Parquet Delta Encoding 
specification
                 Key: PARQUET-1561
                 URL: https://issues.apache.org/jira/browse/PARQUET-1561
             Project: Parquet
          Issue Type: Bug
          Components: parquet-format
            Reporter: Daniel Becker


There are several imprecise/inconsistent formulations in the specification of 
the Parquet Delta Encoding 
([https://github.com/apache/parquet-format/blob/master/Encodings.md]).
# In the beginning of the Delta Encoding section, it is written that
{quote}When there are not enough values to encode a full block we pad with 
zeros (added to the frame of reference).{quote}
>From the parquet-mr implementation of Delta Encoding 
>([https://github.com/apache/parquet-mr/blob/dc61e510126aaa1a95a46fe39bf1529f394147e9/parquet-column/src/main/java/org/apache/parquet/column/values/delta/DeltaBinaryPackingValuesWriterForInteger.java]),
> it seems that when the number of elements does not fill a complete miniblock, 
>we do use padding (otherwise the data would not always end on a byte 
>boundary), but that short blocks are not padded, i.e. we do not add 
>empty/unspecified miniblocks to the block and do not even set the bit width to 
>zero for the remaining miniblocks (which is not very good in my opinion). The 
>specification should be clearer on this point.
# In the description of the header, it is written that
{quote}the block size is a multiple of 128 stored as VLQ int{quote}
According to Wikipedia, VLQ is big-endian and the corresponding little-endian 
encoding us ULEB128 ([https://en.wikipedia.org/wiki/Variable-length_quantity], 
[https://en.wikipedia.org/wiki/LEB128]). The parquet-mr implementation uses the 
little-endian format. The number encoding is called VLQ in the whole Delta 
Encoding specification, not just here.
As the implementaion is already in use, the best would be to update the 
specification to match the implementation.
# The next line is:
{quote}the miniblock count per block is a diviser of the block size stored as 
VLQ int the number of values in the miniblock is a multiple of 32.{quote}
This should be stylistically improved. Also, divisor is spelled with an ‘o’. 
For example:
{quote}the miniblock count per block is a divisor of the block size such that 
their quotient, the number of values in a miniblock, is a multiple of 32; it is 
stored as a ULEB128 int{quote}
# In the section describing the block:
{quote}the min delta is a VLQ int{quote}
I think it should be more precise and say that the min delta is a zigzag 
VLQ/ULEB int as plain VLQ and ULEB are unsigned and the zigzag version is 
actually used in parquet-mr.
# Later in the same section:
{quote}Having multiple blocks allows us to escape values and restart from a new 
base value.{quote}
The reader may think that in each block, we have a new base value according to 
which we compute the delta of the next element, but it is not true. The base 
value is the very first value in the page, which is stored in the header. What 
the author meant is that we have a new min delta in each block that is the 
frame of reference for the deltas in that block (we subtract it from the deltas 
to make them non-negative), but in my opinion it is not clear from this 
sentence.
# In the section describing the algorithm to encode the values (beginning with 
“To encode each delta block...“), in step 2, it says this:
{quote}Encode the first value as zigzag VLQ int{quote}
This is misleading as we do not store the first value of each block as a 
VLQ/ULEB int, only the very first value in the page is stored in such a way, in 
the header, not in each block. Generally I think the description of the 
algorithm could be more straightforward, I find it a little difficult to 
understand.
# In the examples, the block sizes are not multiples of 128, but the 
specification requires that. Either the examples should be replaced with valid 
ones or it should be noted that this is to keep the examples shorter. Also, it 
would be useful to include examples with multiple blocks.
# In the ‘Characteristics’ section, miniblock is written in two words, while in 
the rest of the specification, it is written as one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to