alkis commented on code in PR #2980: URL: https://github.com/apache/thrift/pull/2980#discussion_r1622134672
########## doc/specs/thrift-compact-protocol.md: ########## @@ -48,43 +48,37 @@ For background on Thrift see the [Thrift whitepaper (pdf)](https://thrift.apache ### Integer encoding -The _compact protocol_ uses multiple encodings for ints: the _zigzag int_, and the _var int_. +The _compact protocol_ uses [ZigZag](https://en.wikipedia.org/wiki/Variable-length_quantity#Zigzag_encoding)'ed +varint (aka [ULEB128](https://en.wikipedia.org/wiki/LEB128)) encoding. -Values of type `int32` and `int64` are first transformed to a *zigzag int*. A zigzag int folds positive and negative -numbers into the positive number space. When we read 0, 1, 2, 3, 4 or 5 from the wire, this is translated to 0, -1, 1, --2 or 2 respectively. Here are the (Scala) formulas to convert from int32/int64 to a zigzag int and back: +Values of type `int8` are encoded as one byte, rest are converted to `int64`, Zigzag'ed then encoded as varint. +Zigzag encoding maps signed integers to another domain, one where the sign bit is encoded in the least significant +bit (LSB). For example 0 maps to 0, -1 to 1, 1 to 2, -2 to 3, etc. Hence the term zigzag. Mapping the sign bit to +the LSB is important for compactness when the absolute value of the value is small, as ULEB encoding is more +efficient for small values. Here are the (Scala) formulas to convert from `int64` to a zigzag `int64` and back: ```scala -def intToZigZag(n: Int): Int = (n << 1) ^ (n >> 31) -def zigzagToInt(n: Int): Int = (n >>> 1) ^ - (n & 1) -def longToZigZag(n: Long): Long = (n << 1) ^ (n >> 63) -def zigzagToLong(n: Long): Long = (n >>> 1) ^ - (n & 1) +def ToZigzag(n: Long): Long = (n << 1) ^ (n >> 63) +def FromZigzag(n: Long): Long = (n >>> 1) ^ - (n & 1) ``` -The zigzag int is then encoded as a *var int*, also known as *Unsigned LEB128*. Var ints take 1 to 5 bytes (int32) or -1 to 10 bytes (int64). The process consists in taking a Big Endian unsigned integer, left-padding the bit-string to -make it a multiple of 7 bits, splitting it into 7-bit groups, prefixing the most-significant 7-bit group with the 0 -bit, prefixing the remaining 7-bit groups with the 1 bit and encoding the resulting bit-string in Little Endian. +A ULEB128 is encoded 7-bits at a time, starting from the LSB. Each 7-bits are encoded as 8-bits with the top bit set +if this is not the last byte, unset otherwise. For example, the integer 50399 is encoded as follows: ``` -50399 = 1100 0100 1101 1111 (Big Endian representation) - = 00000 1100 0100 1101 1111 (Left-padding) - = 0000011 0001001 1011111 (7-bit groups) - = 00000011 10001001 11011111 (Most-significant bit prefixes) - = 11011111 10001001 00000011 (Little Endian representation) Review Comment: I see how the new version can be confusing. FWIW I find the existing one confusing, particularly the big-endian little-endian references since there is none of that really happening. I have rewritten the right side, can you take another look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@thrift.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org