[ 
https://issues.apache.org/jira/browse/PARQUET-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1561:
--------------------------------------
    Fix Version/s: format-2.7.0

> Inconsistencies in the Parquet Delta Encoding specification
> -----------------------------------------------------------
>
>                 Key: PARQUET-1561
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1561
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Daniel Becker
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: format-2.7.0
>
>
> There are several imprecise/inconsistent formulations in the specification of 
> the Parquet Delta Encoding 
> ([https://github.com/apache/parquet-format/blob/master/Encodings.md]).
> # In the beginning of the Delta Encoding section, it is written that
> {quote}When there are not enough values to encode a full block we pad with 
> zeros (added to the frame of reference).{quote}
> From the parquet-mr implementation of Delta Encoding 
> ([https://github.com/apache/parquet-mr/blob/dc61e510126aaa1a95a46fe39bf1529f394147e9/parquet-column/src/main/java/org/apache/parquet/column/values/delta/DeltaBinaryPackingValuesWriterForInteger.java]),
>  it seems that when the number of elements does not fill a complete 
> miniblock, we do use padding (otherwise the data would not always end on a 
> byte boundary), but that short blocks are not padded, i.e. we do not add 
> empty/unspecified miniblocks to the block and do not even set the bit width 
> to zero for the remaining miniblocks (which is not very good in my opinion). 
> The specification should be clearer on this point.
> # In the description of the header, it is written that
> {quote}the block size is a multiple of 128 stored as VLQ int{quote}
> According to Wikipedia, VLQ is big-endian and the corresponding little-endian 
> encoding us ULEB128 
> ([https://en.wikipedia.org/wiki/Variable-length_quantity], 
> [https://en.wikipedia.org/wiki/LEB128]). The parquet-mr implementation uses 
> the little-endian format. The number encoding is called VLQ in the whole 
> Delta Encoding specification, not just here.
> As the implementaion is already in use, the best would be to update the 
> specification to match the implementation.
> # The next line is:
> {quote}the miniblock count per block is a diviser of the block size stored as 
> VLQ int the number of values in the miniblock is a multiple of 32.{quote}
> This should be stylistically improved. Also, divisor is spelled with an ‘o’. 
> For example:
> {quote}the miniblock count per block is a divisor of the block size such that 
> their quotient, the number of values in a miniblock, is a multiple of 32; it 
> is stored as a ULEB128 int{quote}
> # In the section describing the block:
> {quote}the min delta is a VLQ int{quote}
> I think it should be more precise and say that the min delta is a zigzag 
> VLQ/ULEB int as plain VLQ and ULEB are unsigned and the zigzag version is 
> actually used in parquet-mr.
> # Later in the same section:
> {quote}Having multiple blocks allows us to escape values and restart from a 
> new base value.{quote}
> The reader may think that in each block, we have a new base value according 
> to which we compute the delta of the next element, but it is not true. The 
> base value is the very first value in the page, which is stored in the 
> header. What the author meant is that we have a new min delta in each block 
> that is the frame of reference for the deltas in that block (we subtract it 
> from the deltas to make them non-negative), but in my opinion it is not clear 
> from this sentence.
> # In the section describing the algorithm to encode the values (beginning 
> with “To encode each delta block...“), in step 2, it says this:
> {quote}Encode the first value as zigzag VLQ int{quote}
> This is misleading as we do not store the first value of each block as a 
> VLQ/ULEB int, only the very first value in the page is stored in such a way, 
> in the header, not in each block. Generally I think the description of the 
> algorithm could be more straightforward, I find it a little difficult to 
> understand.
> # In the examples, the block sizes are not multiples of 128, but the 
> specification requires that. Either the examples should be replaced with 
> valid ones or it should be noted that this is to keep the examples shorter. 
> Also, it would be useful to include examples with multiple blocks.
> # In the ‘Characteristics’ section, miniblock is written in two words, while 
> in the rest of the specification, it is written as one.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to