Github user omalley commented on a diff in the pull request:
https://github.com/apache/orc/pull/245#discussion_r181456234
--- Diff: site/_docs/file-tail.md ---
@@ -249,12 +249,25 @@ For booleans, the statistics include the count of
false and true values.
}
```
-For decimals, the minimum, maximum, and sum are stored.
+For decimals, the minimum, maximum, and sum are stored. In ORC 2.0,
+string representation is deprecated and DecimalStatistics uses integers
+which have better performance.
```message DecimalStatistics {
optional string minimum = 1;
optional string maximum = 2;
optional string sum = 3;
+ message Int128 {
--- End diff --
Let's pull the Int128 out of DecimalStatistics. We will likely use it other
places.
One concern with this representation is that -1 is pretty painful. You'll
get highBits = -1, lowBits = -1, which will only take 1 byte for highBits, but
9 bytes for lowBits (+ the 4 bytes of field identifiers & message length) = 14
bytes total. Another alternative is to use the zigzag encoding for the combined
128 bit value:
optional uint64 minLow = 4;
optional uint64 minHigh = 5;
p <= 18:
minLow = zigzag(min)
minHigh = 0
p > 18:
minLow = low bits of zigzag(min)
minHigh = high bits of zigzag(min)
That would have a representation of 1 byte each for minLow and minHigh + 2
bytes for field identifier = 4. If we leave the Int128 level that would add an
additional + 2 bytes.
---