subject:"\[jira\] \[Commented\] \(PARQUET\-2215\) Document how DELTA_BINARY

[jira] [Commented] (PARQUET-2215) Document how DELTA_BINARY_PACKED handles overflow for deltas

2022-11-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17638575#comment-17638575
 ] 

ASF GitHub Bot commented on PARQUET-2215:
-

pitrou merged PR #187:
URL: https://github.com/apache/parquet-format/pull/187




> Document how DELTA_BINARY_PACKED handles overflow for deltas
> 
>
> Key: PARQUET-2215
> URL: https://issues.apache.org/jira/browse/PARQUET-2215
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format
>Reporter: Rok Mihevc
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: docs
>
> [Current 
> docs|https://github.com/apache/parquet-format/blob/master/Encodings.md?plain=1#L160]
>  do not explicitly state how overflow is handled.
> [See 
> discussion|https://github.com/apache/arrow/pull/14191#discussion_r1028298973] 
> for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2215) Document how DELTA_BINARY_PACKED handles overflow for deltas

2022-11-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637687#comment-17637687
 ] 

ASF GitHub Bot commented on PARQUET-2215:
-

rok commented on code in PR #187:
URL: https://github.com/apache/parquet-format/pull/187#discussion_r1030268086


##
Encodings.md:
##
@@ -153,52 +153,88 @@ repetition and definition levels.
 ### Delta Encoding (DELTA_BINARY_PACKED = 5)
 Supported Types: INT32, INT64
 
-This encoding is adapted from the Binary packing described in ["Decoding 
billions of integers per second through 
vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf) by D. Lemire and L. 
Boytsov.
+This encoding is adapted from the Binary packing described in
+["Decoding billions of integers per second through 
vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf)
+by D. Lemire and L. Boytsov.
 
-In delta encoding we make use of variable length integers for storing various 
numbers (not the deltas themselves). For unsigned values, we use ULEB128, which 
is the unsigned version of LEB128 
(https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128). For signed values, we 
use zigzag encoding 
(https://developers.google.com/protocol-buffers/docs/encoding#signed-integers) 
to map negative values to positive ones and apply ULEB128 on the result.
+In delta encoding we make use of variable length integers for storing various
+numbers (not the deltas themselves). For unsigned values, we use ULEB128,
+which is the unsigned version of LEB128 
(https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128).
+For signed values, we use zigzag encoding 
(https://developers.google.com/protocol-buffers/docs/encoding#signed-integers)
+to map negative values to positive ones and apply ULEB128 on the result.
 
-Delta encoding consists of a header followed by blocks of delta encoded values 
binary packed. Each block is made of miniblocks, each of them binary packed 
with its own bit width.
+Delta encoding consists of a header followed by blocks of delta encoded values
+binary packed. Each block is made of miniblocks, each of them binary packed 
with its own bit width.
 
 The header is defined as follows:
 ```


 ```
  * the block size is a multiple of 128; it is stored as a ULEB128 int
- * the miniblock count per block is a divisor of the block size such that 
their quotient, the number of values in a miniblock, is a multiple of 32; it is 
stored as a ULEB128 int
+ * the miniblock count per block is a divisor of the block size such that their
+   quotient, the number of values in a miniblock, is a multiple of 32; it is
+   stored as a ULEB128 int
  * the total value count is stored as a ULEB128 int
  * the first value is stored as a zigzag ULEB128 int
 
 Each block contains
 ```
   
 ```
- * the min delta is a zigzag ULEB128 int (we compute a minimum as we need 
positive integers for bit packing)
+ * the min delta is a zigzag ULEB128 int (we compute a minimum as we need
+   positive integers for bit packing)
  * the bitwidth of each block is stored as a byte
- * each miniblock is a list of bit packed ints according to the bit width 
stored at the begining of the block
+ * each miniblock is a list of bit packed ints according to the bit width
+   stored at the begining of the block
 
 To encode a block, we will:
 
-1. Compute the differences between consecutive elements. For the first element 
in the block, use the last element in the previous block or, in the case of the 
first block, use the first value of the whole sequence, stored in the header.
+1. Compute the differences between consecutive elements. For the first
+   element in the block, use the last element in the previous block or, in
+   the case of the first block, use the first value of the whole sequence,
+   stored in the header.
 
-2. Compute the frame of reference (the minimum of the deltas in the block). 
Subtract this min delta from all deltas in the block. This guarantees that all 
values are non-negative.
+2. Compute the frame of reference (the minimum of the deltas in the block).
+   Subtract this min delta from all deltas in the block. This guarantees that
+   all values are non-negative.
 
-3. Encode the frame of reference (min delta) as a zigzag ULEB128 int followed 
by the bit widths of the miniblocks and the delta values (minus the min delta) 
bit packed per miniblock.
+3. Encode the frame of reference (min delta) as a zigzag ULEB128 int followed
+   by the bit widths of the miniblocks and the delta values (minus the min
+   delta) bit-packed per miniblock.
 
-Having multiple blocks allows us to adapt to changes in the data by changing 
the frame of reference (the min delta) which can result in smaller values after 
the subtraction which, again, means we can store them with a lower bit width.
+Having multiple blocks allows us to adapt to changes in the data by changing
+the frame of reference (the min delta) which can result in smaller

[jira] [Commented] (PARQUET-2215) Document how DELTA_BINARY_PACKED handles overflow for deltas

2022-11-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637677#comment-17637677
 ] 

ASF GitHub Bot commented on PARQUET-2215:
-

pitrou commented on PR #187:
URL: https://github.com/apache/parquet-format/pull/187#issuecomment-1324796485

   @ksuarez1423 @wjones127 Would you like to take a look at the wording?




> Document how DELTA_BINARY_PACKED handles overflow for deltas
> 
>
> Key: PARQUET-2215
> URL: https://issues.apache.org/jira/browse/PARQUET-2215
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format
>Reporter: Rok Mihevc
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: docs
>
> [Current 
> docs|https://github.com/apache/parquet-format/blob/master/Encodings.md?plain=1#L160]
>  do not explicitly state how overflow is handled.
> [See 
> discussion|https://github.com/apache/arrow/pull/14191#discussion_r1028298973] 
> for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2215) Document how DELTA_BINARY_PACKED handles overflow for deltas

2022-11-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637676#comment-17637676
 ] 

ASF GitHub Bot commented on PARQUET-2215:
-

pitrou commented on code in PR #187:
URL: https://github.com/apache/parquet-format/pull/187#discussion_r1030229202


##
Encodings.md:
##
@@ -153,52 +153,88 @@ repetition and definition levels.
 ### Delta Encoding (DELTA_BINARY_PACKED = 5)
 Supported Types: INT32, INT64
 
-This encoding is adapted from the Binary packing described in ["Decoding 
billions of integers per second through 
vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf) by D. Lemire and L. 
Boytsov.
+This encoding is adapted from the Binary packing described in
+["Decoding billions of integers per second through 
vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf)
+by D. Lemire and L. Boytsov.
 
-In delta encoding we make use of variable length integers for storing various 
numbers (not the deltas themselves). For unsigned values, we use ULEB128, which 
is the unsigned version of LEB128 
(https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128). For signed values, we 
use zigzag encoding 
(https://developers.google.com/protocol-buffers/docs/encoding#signed-integers) 
to map negative values to positive ones and apply ULEB128 on the result.
+In delta encoding we make use of variable length integers for storing various
+numbers (not the deltas themselves). For unsigned values, we use ULEB128,
+which is the unsigned version of LEB128 
(https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128).
+For signed values, we use zigzag encoding 
(https://developers.google.com/protocol-buffers/docs/encoding#signed-integers)
+to map negative values to positive ones and apply ULEB128 on the result.
 
-Delta encoding consists of a header followed by blocks of delta encoded values 
binary packed. Each block is made of miniblocks, each of them binary packed 
with its own bit width.
+Delta encoding consists of a header followed by blocks of delta encoded values
+binary packed. Each block is made of miniblocks, each of them binary packed 
with its own bit width.
 
 The header is defined as follows:
 ```


 ```
  * the block size is a multiple of 128; it is stored as a ULEB128 int
- * the miniblock count per block is a divisor of the block size such that 
their quotient, the number of values in a miniblock, is a multiple of 32; it is 
stored as a ULEB128 int
+ * the miniblock count per block is a divisor of the block size such that their
+   quotient, the number of values in a miniblock, is a multiple of 32; it is
+   stored as a ULEB128 int
  * the total value count is stored as a ULEB128 int
  * the first value is stored as a zigzag ULEB128 int
 
 Each block contains
 ```
   
 ```
- * the min delta is a zigzag ULEB128 int (we compute a minimum as we need 
positive integers for bit packing)
+ * the min delta is a zigzag ULEB128 int (we compute a minimum as we need
+   positive integers for bit packing)
  * the bitwidth of each block is stored as a byte
- * each miniblock is a list of bit packed ints according to the bit width 
stored at the begining of the block
+ * each miniblock is a list of bit packed ints according to the bit width
+   stored at the begining of the block
 
 To encode a block, we will:
 
-1. Compute the differences between consecutive elements. For the first element 
in the block, use the last element in the previous block or, in the case of the 
first block, use the first value of the whole sequence, stored in the header.
+1. Compute the differences between consecutive elements. For the first
+   element in the block, use the last element in the previous block or, in
+   the case of the first block, use the first value of the whole sequence,
+   stored in the header.
 
-2. Compute the frame of reference (the minimum of the deltas in the block). 
Subtract this min delta from all deltas in the block. This guarantees that all 
values are non-negative.
+2. Compute the frame of reference (the minimum of the deltas in the block).
+   Subtract this min delta from all deltas in the block. This guarantees that
+   all values are non-negative.
 
-3. Encode the frame of reference (min delta) as a zigzag ULEB128 int followed 
by the bit widths of the miniblocks and the delta values (minus the min delta) 
bit packed per miniblock.
+3. Encode the frame of reference (min delta) as a zigzag ULEB128 int followed
+   by the bit widths of the miniblocks and the delta values (minus the min
+   delta) bit-packed per miniblock.
 
-Having multiple blocks allows us to adapt to changes in the data by changing 
the frame of reference (the min delta) which can result in smaller values after 
the subtraction which, again, means we can store them with a lower bit width.
+Having multiple blocks allows us to adapt to changes in the data by changing
+the frame of reference (the min delta) which can result in

[jira] [Commented] (PARQUET-2215) Document how DELTA_BINARY_PACKED handles overflow for deltas

2022-11-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637668#comment-17637668
 ] 

ASF GitHub Bot commented on PARQUET-2215:
-

rok commented on code in PR #187:
URL: https://github.com/apache/parquet-format/pull/187#discussion_r1030213552


##
Encodings.md:
##
@@ -153,52 +153,88 @@ repetition and definition levels.
 ### Delta Encoding (DELTA_BINARY_PACKED = 5)
 Supported Types: INT32, INT64
 
-This encoding is adapted from the Binary packing described in ["Decoding 
billions of integers per second through 
vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf) by D. Lemire and L. 
Boytsov.
+This encoding is adapted from the Binary packing described in
+["Decoding billions of integers per second through 
vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf)
+by D. Lemire and L. Boytsov.
 
-In delta encoding we make use of variable length integers for storing various 
numbers (not the deltas themselves). For unsigned values, we use ULEB128, which 
is the unsigned version of LEB128 
(https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128). For signed values, we 
use zigzag encoding 
(https://developers.google.com/protocol-buffers/docs/encoding#signed-integers) 
to map negative values to positive ones and apply ULEB128 on the result.
+In delta encoding we make use of variable length integers for storing various
+numbers (not the deltas themselves). For unsigned values, we use ULEB128,
+which is the unsigned version of LEB128 
(https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128).
+For signed values, we use zigzag encoding 
(https://developers.google.com/protocol-buffers/docs/encoding#signed-integers)
+to map negative values to positive ones and apply ULEB128 on the result.
 
-Delta encoding consists of a header followed by blocks of delta encoded values 
binary packed. Each block is made of miniblocks, each of them binary packed 
with its own bit width.
+Delta encoding consists of a header followed by blocks of delta encoded values
+binary packed. Each block is made of miniblocks, each of them binary packed 
with its own bit width.
 
 The header is defined as follows:
 ```


 ```
  * the block size is a multiple of 128; it is stored as a ULEB128 int
- * the miniblock count per block is a divisor of the block size such that 
their quotient, the number of values in a miniblock, is a multiple of 32; it is 
stored as a ULEB128 int
+ * the miniblock count per block is a divisor of the block size such that their
+   quotient, the number of values in a miniblock, is a multiple of 32; it is
+   stored as a ULEB128 int
  * the total value count is stored as a ULEB128 int
  * the first value is stored as a zigzag ULEB128 int
 
 Each block contains
 ```
   
 ```
- * the min delta is a zigzag ULEB128 int (we compute a minimum as we need 
positive integers for bit packing)
+ * the min delta is a zigzag ULEB128 int (we compute a minimum as we need
+   positive integers for bit packing)
  * the bitwidth of each block is stored as a byte
- * each miniblock is a list of bit packed ints according to the bit width 
stored at the begining of the block
+ * each miniblock is a list of bit packed ints according to the bit width
+   stored at the begining of the block
 
 To encode a block, we will:
 
-1. Compute the differences between consecutive elements. For the first element 
in the block, use the last element in the previous block or, in the case of the 
first block, use the first value of the whole sequence, stored in the header.
+1. Compute the differences between consecutive elements. For the first
+   element in the block, use the last element in the previous block or, in
+   the case of the first block, use the first value of the whole sequence,
+   stored in the header.
 
-2. Compute the frame of reference (the minimum of the deltas in the block). 
Subtract this min delta from all deltas in the block. This guarantees that all 
values are non-negative.
+2. Compute the frame of reference (the minimum of the deltas in the block).
+   Subtract this min delta from all deltas in the block. This guarantees that
+   all values are non-negative.
 
-3. Encode the frame of reference (min delta) as a zigzag ULEB128 int followed 
by the bit widths of the miniblocks and the delta values (minus the min delta) 
bit packed per miniblock.
+3. Encode the frame of reference (min delta) as a zigzag ULEB128 int followed
+   by the bit widths of the miniblocks and the delta values (minus the min
+   delta) bit-packed per miniblock.
 
-Having multiple blocks allows us to adapt to changes in the data by changing 
the frame of reference (the min delta) which can result in smaller values after 
the subtraction which, again, means we can store them with a lower bit width.
+Having multiple blocks allows us to adapt to changes in the data by changing
+the frame of reference (the min delta) which can result in smaller

[jira] [Commented] (PARQUET-2215) Document how DELTA_BINARY_PACKED handles overflow for deltas

2022-11-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637665#comment-17637665
 ] 

ASF GitHub Bot commented on PARQUET-2215:
-

pitrou commented on code in PR #187:
URL: https://github.com/apache/parquet-format/pull/187#discussion_r1030207739


##
Encodings.md:
##
@@ -153,52 +153,88 @@ repetition and definition levels.
 ### Delta Encoding (DELTA_BINARY_PACKED = 5)
 Supported Types: INT32, INT64
 
-This encoding is adapted from the Binary packing described in ["Decoding 
billions of integers per second through 
vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf) by D. Lemire and L. 
Boytsov.
+This encoding is adapted from the Binary packing described in
+["Decoding billions of integers per second through 
vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf)
+by D. Lemire and L. Boytsov.
 
-In delta encoding we make use of variable length integers for storing various 
numbers (not the deltas themselves). For unsigned values, we use ULEB128, which 
is the unsigned version of LEB128 
(https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128). For signed values, we 
use zigzag encoding 
(https://developers.google.com/protocol-buffers/docs/encoding#signed-integers) 
to map negative values to positive ones and apply ULEB128 on the result.
+In delta encoding we make use of variable length integers for storing various
+numbers (not the deltas themselves). For unsigned values, we use ULEB128,
+which is the unsigned version of LEB128 
(https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128).
+For signed values, we use zigzag encoding 
(https://developers.google.com/protocol-buffers/docs/encoding#signed-integers)
+to map negative values to positive ones and apply ULEB128 on the result.
 
-Delta encoding consists of a header followed by blocks of delta encoded values 
binary packed. Each block is made of miniblocks, each of them binary packed 
with its own bit width.
+Delta encoding consists of a header followed by blocks of delta encoded values
+binary packed. Each block is made of miniblocks, each of them binary packed 
with its own bit width.
 
 The header is defined as follows:
 ```


 ```
  * the block size is a multiple of 128; it is stored as a ULEB128 int
- * the miniblock count per block is a divisor of the block size such that 
their quotient, the number of values in a miniblock, is a multiple of 32; it is 
stored as a ULEB128 int
+ * the miniblock count per block is a divisor of the block size such that their
+   quotient, the number of values in a miniblock, is a multiple of 32; it is
+   stored as a ULEB128 int
  * the total value count is stored as a ULEB128 int
  * the first value is stored as a zigzag ULEB128 int
 
 Each block contains
 ```
   
 ```
- * the min delta is a zigzag ULEB128 int (we compute a minimum as we need 
positive integers for bit packing)
+ * the min delta is a zigzag ULEB128 int (we compute a minimum as we need
+   positive integers for bit packing)
  * the bitwidth of each block is stored as a byte
- * each miniblock is a list of bit packed ints according to the bit width 
stored at the begining of the block
+ * each miniblock is a list of bit packed ints according to the bit width
+   stored at the begining of the block
 
 To encode a block, we will:
 
-1. Compute the differences between consecutive elements. For the first element 
in the block, use the last element in the previous block or, in the case of the 
first block, use the first value of the whole sequence, stored in the header.
+1. Compute the differences between consecutive elements. For the first
+   element in the block, use the last element in the previous block or, in
+   the case of the first block, use the first value of the whole sequence,
+   stored in the header.
 
-2. Compute the frame of reference (the minimum of the deltas in the block). 
Subtract this min delta from all deltas in the block. This guarantees that all 
values are non-negative.
+2. Compute the frame of reference (the minimum of the deltas in the block).
+   Subtract this min delta from all deltas in the block. This guarantees that
+   all values are non-negative.
 
-3. Encode the frame of reference (min delta) as a zigzag ULEB128 int followed 
by the bit widths of the miniblocks and the delta values (minus the min delta) 
bit packed per miniblock.
+3. Encode the frame of reference (min delta) as a zigzag ULEB128 int followed
+   by the bit widths of the miniblocks and the delta values (minus the min
+   delta) bit-packed per miniblock.
 
-Having multiple blocks allows us to adapt to changes in the data by changing 
the frame of reference (the min delta) which can result in smaller values after 
the subtraction which, again, means we can store them with a lower bit width.
+Having multiple blocks allows us to adapt to changes in the data by changing
+the frame of reference (the min delta) which can result in

[jira] [Commented] (PARQUET-2215) Document how DELTA_BINARY_PACKED handles overflow for deltas

2022-11-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637654#comment-17637654
 ] 

ASF GitHub Bot commented on PARQUET-2215:
-

rok commented on code in PR #187:
URL: https://github.com/apache/parquet-format/pull/187#discussion_r1030180263


##
Encodings.md:
##
@@ -153,52 +153,88 @@ repetition and definition levels.
 ### Delta Encoding (DELTA_BINARY_PACKED = 5)
 Supported Types: INT32, INT64
 
-This encoding is adapted from the Binary packing described in ["Decoding 
billions of integers per second through 
vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf) by D. Lemire and L. 
Boytsov.
+This encoding is adapted from the Binary packing described in
+["Decoding billions of integers per second through 
vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf)
+by D. Lemire and L. Boytsov.
 
-In delta encoding we make use of variable length integers for storing various 
numbers (not the deltas themselves). For unsigned values, we use ULEB128, which 
is the unsigned version of LEB128 
(https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128). For signed values, we 
use zigzag encoding 
(https://developers.google.com/protocol-buffers/docs/encoding#signed-integers) 
to map negative values to positive ones and apply ULEB128 on the result.
+In delta encoding we make use of variable length integers for storing various
+numbers (not the deltas themselves). For unsigned values, we use ULEB128,
+which is the unsigned version of LEB128 
(https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128).
+For signed values, we use zigzag encoding 
(https://developers.google.com/protocol-buffers/docs/encoding#signed-integers)
+to map negative values to positive ones and apply ULEB128 on the result.
 
-Delta encoding consists of a header followed by blocks of delta encoded values 
binary packed. Each block is made of miniblocks, each of them binary packed 
with its own bit width.
+Delta encoding consists of a header followed by blocks of delta encoded values
+binary packed. Each block is made of miniblocks, each of them binary packed 
with its own bit width.
 
 The header is defined as follows:
 ```


 ```
  * the block size is a multiple of 128; it is stored as a ULEB128 int
- * the miniblock count per block is a divisor of the block size such that 
their quotient, the number of values in a miniblock, is a multiple of 32; it is 
stored as a ULEB128 int
+ * the miniblock count per block is a divisor of the block size such that their
+   quotient, the number of values in a miniblock, is a multiple of 32; it is
+   stored as a ULEB128 int
  * the total value count is stored as a ULEB128 int
  * the first value is stored as a zigzag ULEB128 int
 
 Each block contains
 ```
   
 ```
- * the min delta is a zigzag ULEB128 int (we compute a minimum as we need 
positive integers for bit packing)
+ * the min delta is a zigzag ULEB128 int (we compute a minimum as we need
+   positive integers for bit packing)
  * the bitwidth of each block is stored as a byte
- * each miniblock is a list of bit packed ints according to the bit width 
stored at the begining of the block
+ * each miniblock is a list of bit packed ints according to the bit width
+   stored at the begining of the block
 
 To encode a block, we will:
 
-1. Compute the differences between consecutive elements. For the first element 
in the block, use the last element in the previous block or, in the case of the 
first block, use the first value of the whole sequence, stored in the header.
+1. Compute the differences between consecutive elements. For the first
+   element in the block, use the last element in the previous block or, in
+   the case of the first block, use the first value of the whole sequence,
+   stored in the header.
 
-2. Compute the frame of reference (the minimum of the deltas in the block). 
Subtract this min delta from all deltas in the block. This guarantees that all 
values are non-negative.
+2. Compute the frame of reference (the minimum of the deltas in the block).
+   Subtract this min delta from all deltas in the block. This guarantees that
+   all values are non-negative.
 
-3. Encode the frame of reference (min delta) as a zigzag ULEB128 int followed 
by the bit widths of the miniblocks and the delta values (minus the min delta) 
bit packed per miniblock.
+3. Encode the frame of reference (min delta) as a zigzag ULEB128 int followed
+   by the bit widths of the miniblocks and the delta values (minus the min
+   delta) bit-packed per miniblock.
 
-Having multiple blocks allows us to adapt to changes in the data by changing 
the frame of reference (the min delta) which can result in smaller values after 
the subtraction which, again, means we can store them with a lower bit width.
+Having multiple blocks allows us to adapt to changes in the data by changing
+the frame of reference (the min delta) which can result in smaller

[jira] [Commented] (PARQUET-2215) Document how DELTA_BINARY_PACKED handles overflow for deltas

2022-11-23 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637636#comment-17637636
 ] 

ASF GitHub Bot commented on PARQUET-2215:
-

pitrou commented on PR #187:
URL: https://github.com/apache/parquet-format/pull/187#issuecomment-1324708857

   cc @rok




> Document how DELTA_BINARY_PACKED handles overflow for deltas
> 
>
> Key: PARQUET-2215
> URL: https://issues.apache.org/jira/browse/PARQUET-2215
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-format
>Reporter: Rok Mihevc
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: docs
>
> [Current 
> docs|https://github.com/apache/parquet-format/blob/master/Encodings.md?plain=1#L160]
>  do not explicitly state how overflow is handled.
> [See 
> discussion|https://github.com/apache/arrow/pull/14191#discussion_r1028298973] 
> for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2215) Document how DELTA_BINARY_PACKED handles overflow for deltas

[jira] [Commented] (PARQUET-2215) Document how DELTA_BINARY_PACKED handles overflow for deltas

[jira] [Commented] (PARQUET-2215) Document how DELTA_BINARY_PACKED handles overflow for deltas

[jira] [Commented] (PARQUET-2215) Document how DELTA_BINARY_PACKED handles overflow for deltas

[jira] [Commented] (PARQUET-2215) Document how DELTA_BINARY_PACKED handles overflow for deltas

[jira] [Commented] (PARQUET-2215) Document how DELTA_BINARY_PACKED handles overflow for deltas

[jira] [Commented] (PARQUET-2215) Document how DELTA_BINARY_PACKED handles overflow for deltas

[jira] [Commented] (PARQUET-2215) Document how DELTA_BINARY_PACKED handles overflow for deltas

8 matches

Site Navigation

Mail list logo

Footer information