[
https://issues.apache.org/jira/browse/PARQUET-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832433#comment-16832433
]
ASF GitHub Bot commented on PARQUET-1561:
-----------------------------------------
zivanfi commented on pull request #129: PARQUET-1561: Remove erroneous brackets.
URL: https://github.com/apache/parquet-format/pull/129
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Inconsistencies in the Parquet Delta Encoding specification
> -----------------------------------------------------------
>
> Key: PARQUET-1561
> URL: https://issues.apache.org/jira/browse/PARQUET-1561
> Project: Parquet
> Issue Type: Bug
> Components: parquet-format
> Reporter: Daniel Becker
> Priority: Major
> Labels: pull-request-available
>
> There are several imprecise/inconsistent formulations in the specification of
> the Parquet Delta Encoding
> ([https://github.com/apache/parquet-format/blob/master/Encodings.md]).
> # In the beginning of the Delta Encoding section, it is written that
> {quote}When there are not enough values to encode a full block we pad with
> zeros (added to the frame of reference).{quote}
> From the parquet-mr implementation of Delta Encoding
> ([https://github.com/apache/parquet-mr/blob/dc61e510126aaa1a95a46fe39bf1529f394147e9/parquet-column/src/main/java/org/apache/parquet/column/values/delta/DeltaBinaryPackingValuesWriterForInteger.java]),
> it seems that when the number of elements does not fill a complete
> miniblock, we do use padding (otherwise the data would not always end on a
> byte boundary), but that short blocks are not padded, i.e. we do not add
> empty/unspecified miniblocks to the block and do not even set the bit width
> to zero for the remaining miniblocks (which is not very good in my opinion).
> The specification should be clearer on this point.
> # In the description of the header, it is written that
> {quote}the block size is a multiple of 128 stored as VLQ int{quote}
> According to Wikipedia, VLQ is big-endian and the corresponding little-endian
> encoding us ULEB128
> ([https://en.wikipedia.org/wiki/Variable-length_quantity],
> [https://en.wikipedia.org/wiki/LEB128]). The parquet-mr implementation uses
> the little-endian format. The number encoding is called VLQ in the whole
> Delta Encoding specification, not just here.
> As the implementaion is already in use, the best would be to update the
> specification to match the implementation.
> # The next line is:
> {quote}the miniblock count per block is a diviser of the block size stored as
> VLQ int the number of values in the miniblock is a multiple of 32.{quote}
> This should be stylistically improved. Also, divisor is spelled with an ‘o’.
> For example:
> {quote}the miniblock count per block is a divisor of the block size such that
> their quotient, the number of values in a miniblock, is a multiple of 32; it
> is stored as a ULEB128 int{quote}
> # In the section describing the block:
> {quote}the min delta is a VLQ int{quote}
> I think it should be more precise and say that the min delta is a zigzag
> VLQ/ULEB int as plain VLQ and ULEB are unsigned and the zigzag version is
> actually used in parquet-mr.
> # Later in the same section:
> {quote}Having multiple blocks allows us to escape values and restart from a
> new base value.{quote}
> The reader may think that in each block, we have a new base value according
> to which we compute the delta of the next element, but it is not true. The
> base value is the very first value in the page, which is stored in the
> header. What the author meant is that we have a new min delta in each block
> that is the frame of reference for the deltas in that block (we subtract it
> from the deltas to make them non-negative), but in my opinion it is not clear
> from this sentence.
> # In the section describing the algorithm to encode the values (beginning
> with “To encode each delta block...“), in step 2, it says this:
> {quote}Encode the first value as zigzag VLQ int{quote}
> This is misleading as we do not store the first value of each block as a
> VLQ/ULEB int, only the very first value in the page is stored in such a way,
> in the header, not in each block. Generally I think the description of the
> algorithm could be more straightforward, I find it a little difficult to
> understand.
> # In the examples, the block sizes are not multiples of 128, but the
> specification requires that. Either the examples should be replaced with
> valid ones or it should be noted that this is to keep the examples shorter.
> Also, it would be useful to include examples with multiple blocks.
> # In the ‘Characteristics’ section, miniblock is written in two words, while
> in the rest of the specification, it is written as one.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)