Re: [PR] Update compact-protocol docs [skip ci] [thrift]

via GitHub Fri, 31 May 2024 02:53:19 -0700


alkis commented on code in PR #2980:
URL: https://github.com/apache/thrift/pull/2980#discussion_r1622134672



##########
doc/specs/thrift-compact-protocol.md:
##########
@@ -48,43 +48,37 @@ For background on Thrift see the [Thrift whitepaper 
(pdf)](https://thrift.apache
 
 ### Integer encoding
 
-The _compact protocol_ uses multiple encodings for ints: the _zigzag int_, and 
the _var int_.
+The _compact protocol_ uses 
[ZigZag](https://en.wikipedia.org/wiki/Variable-length_quantity#Zigzag_encoding)'ed
+varint (aka [ULEB128](https://en.wikipedia.org/wiki/LEB128)) encoding.
 
-Values of type `int32` and `int64` are first transformed to a *zigzag int*. A 
zigzag int folds positive and negative
-numbers into the positive number space. When we read 0, 1, 2, 3, 4 or 5 from 
the wire, this is translated to 0, -1, 1,
--2 or 2 respectively. Here are the (Scala) formulas to convert from 
int32/int64 to a zigzag int and back:
+Values of type `int8` are encoded as one byte, rest are converted to `int64`, 
Zigzag'ed then encoded as varint.
+Zigzag encoding maps signed integers to another domain, one where the sign bit 
is encoded in the least significant
+bit (LSB). For example 0 maps to 0, -1 to 1, 1 to 2, -2 to 3, etc. Hence the 
term zigzag. Mapping the sign bit to
+the LSB is important for compactness when the absolute value of the value is 
small, as ULEB encoding is more
+efficient for small values. Here are the (Scala) formulas to convert from 
`int64` to a zigzag `int64` and back:
 
 ```scala
-def intToZigZag(n: Int): Int = (n << 1) ^ (n >> 31)
-def zigzagToInt(n: Int): Int = (n >>> 1) ^ - (n & 1)
-def longToZigZag(n: Long): Long = (n << 1) ^ (n >> 63)
-def zigzagToLong(n: Long): Long = (n >>> 1) ^ - (n & 1)
+def ToZigzag(n: Long): Long = (n << 1) ^ (n >> 63)
+def FromZigzag(n: Long): Long = (n >>> 1) ^ - (n & 1)
 ```
 
-The zigzag int is then encoded as a *var int*, also known as *Unsigned 
LEB128*.  Var ints take 1 to 5 bytes (int32) or 
-1 to 10 bytes (int64). The process consists in taking a Big Endian unsigned 
integer, left-padding the bit-string to 
-make it a multiple of 7 bits, splitting it into 7-bit groups, prefixing the 
most-significant 7-bit group with the 0 
-bit, prefixing the remaining 7-bit groups with the 1 bit and encoding the 
resulting bit-string in Little Endian.
+A ULEB128 is encoded 7-bits at a time, starting from the LSB. Each 7-bits are 
encoded as 8-bits with the top bit set
+if this is not the last byte, unset otherwise.
 
 For example, the integer 50399 is encoded as follows:
 
 ```
-50399 = 1100 0100 1101 1111         (Big Endian representation)
-      = 00000 1100 0100 1101 1111   (Left-padding)
-      = 0000011 0001001 1011111     (7-bit groups)
-      = 00000011 10001001 11011111  (Most-significant bit prefixes)
-      = 11011111 10001001 00000011  (Little Endian representation)

Review Comment:
   I see how the new version can be confusing.
   
   FWIW I find the existing one confusing, particularly the big-endian 
little-endian references since there is none of that really happening.
   
   I have rewritten the right side, can you take another look?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@thrift.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Update compact-protocol docs [skip ci] [thrift]

Reply via email to