[ https://issues.apache.org/jira/browse/PARQUET-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gabor Szadovszky updated PARQUET-1561: -------------------------------------- Fix Version/s: format-2.7.0 > Inconsistencies in the Parquet Delta Encoding specification > ----------------------------------------------------------- > > Key: PARQUET-1561 > URL: https://issues.apache.org/jira/browse/PARQUET-1561 > Project: Parquet > Issue Type: Bug > Components: parquet-format > Reporter: Daniel Becker > Priority: Major > Labels: pull-request-available > Fix For: format-2.7.0 > > > There are several imprecise/inconsistent formulations in the specification of > the Parquet Delta Encoding > ([https://github.com/apache/parquet-format/blob/master/Encodings.md]). > # In the beginning of the Delta Encoding section, it is written that > {quote}When there are not enough values to encode a full block we pad with > zeros (added to the frame of reference).{quote} > From the parquet-mr implementation of Delta Encoding > ([https://github.com/apache/parquet-mr/blob/dc61e510126aaa1a95a46fe39bf1529f394147e9/parquet-column/src/main/java/org/apache/parquet/column/values/delta/DeltaBinaryPackingValuesWriterForInteger.java]), > it seems that when the number of elements does not fill a complete > miniblock, we do use padding (otherwise the data would not always end on a > byte boundary), but that short blocks are not padded, i.e. we do not add > empty/unspecified miniblocks to the block and do not even set the bit width > to zero for the remaining miniblocks (which is not very good in my opinion). > The specification should be clearer on this point. > # In the description of the header, it is written that > {quote}the block size is a multiple of 128 stored as VLQ int{quote} > According to Wikipedia, VLQ is big-endian and the corresponding little-endian > encoding us ULEB128 > ([https://en.wikipedia.org/wiki/Variable-length_quantity], > [https://en.wikipedia.org/wiki/LEB128]). The parquet-mr implementation uses > the little-endian format. The number encoding is called VLQ in the whole > Delta Encoding specification, not just here. > As the implementaion is already in use, the best would be to update the > specification to match the implementation. > # The next line is: > {quote}the miniblock count per block is a diviser of the block size stored as > VLQ int the number of values in the miniblock is a multiple of 32.{quote} > This should be stylistically improved. Also, divisor is spelled with an ‘o’. > For example: > {quote}the miniblock count per block is a divisor of the block size such that > their quotient, the number of values in a miniblock, is a multiple of 32; it > is stored as a ULEB128 int{quote} > # In the section describing the block: > {quote}the min delta is a VLQ int{quote} > I think it should be more precise and say that the min delta is a zigzag > VLQ/ULEB int as plain VLQ and ULEB are unsigned and the zigzag version is > actually used in parquet-mr. > # Later in the same section: > {quote}Having multiple blocks allows us to escape values and restart from a > new base value.{quote} > The reader may think that in each block, we have a new base value according > to which we compute the delta of the next element, but it is not true. The > base value is the very first value in the page, which is stored in the > header. What the author meant is that we have a new min delta in each block > that is the frame of reference for the deltas in that block (we subtract it > from the deltas to make them non-negative), but in my opinion it is not clear > from this sentence. > # In the section describing the algorithm to encode the values (beginning > with “To encode each delta block...“), in step 2, it says this: > {quote}Encode the first value as zigzag VLQ int{quote} > This is misleading as we do not store the first value of each block as a > VLQ/ULEB int, only the very first value in the page is stored in such a way, > in the header, not in each block. Generally I think the description of the > algorithm could be more straightforward, I find it a little difficult to > understand. > # In the examples, the block sizes are not multiples of 128, but the > specification requires that. Either the examples should be replaced with > valid ones or it should be noted that this is to keep the examples shorter. > Also, it would be useful to include examples with multiple blocks. > # In the ‘Characteristics’ section, miniblock is written in two words, while > in the rest of the specification, it is written as one. -- This message was sent by Atlassian Jira (v8.3.2#803003)