ASF GitHub Bot commented on ORC-161:
Github user wgtmac commented on a diff in the pull request:
--- Diff: site/_docs/encodings.md ---
@@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes | Boolean
Decimal was introduced in Hive 0.11 with infinite precision (the total
number of digits). In Hive 0.13, the definition was change to limit
the precision to a maximum of 38 digits, which conveniently uses 127
-bits plus a sign bit. The current encoding of decimal columns stores
-the integer representation of the value as an unbounded length zigzag
-encoded base 128 varint. The scale is stored in the SECONDARY stream
-as an signed integer.
+bits plus a sign bit.
+DIRECT and DIRECT_V2 encodings of decimal columns stores the integer
+representation of the value as an unbounded length zigzag encoded base
+128 varint. The scale is stored in the SECONDARY stream as an signed
+In ORC 2.0, DECIMAL encoding is introduced and totally remove scale
+stream as all decimal values use the same scale. When precision is
+no greater than 18, decimal values can be fully represented by DATA
+stream which stores 64-bit signed integers. When precision is greater
+than 18, we use a 128-bit signed integer to store the decimal value.
+DATA stream stores the higher 64 bits and SECONDARY stream holds the
+lower 64 bits. Both streams use signed integer RLE v2.
--- End diff --
The main problem is that we don't have 128-bit integer RLE on hand.
> Create a new column type that run-length-encodes decimals
> Key: ORC-161
> URL: https://issues.apache.org/jira/browse/ORC-161
> Project: ORC
> Issue Type: Wish
> Components: encoding
> Reporter: Douglas Drinka
> Priority: Major
> I'm storing prices in ORC format, and have made the following observations
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or
> minus a few pennies per data point. This would encode beautifully with a
> patched base encoding. Instead I'm averaging 4 bytes per data point, after
> - Everyone acknowledges that it's nice to be able to store huge numbers in
> decimal columns, but that you probably won't. Presto, for instance, has a
> fast-path which engages for precision of 18 or less, and decodes to 64-bit
> longs, and then a slow path which uses BigInt. I anticipate the majority of
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data
> point is messy. Sometimes it's checked on data ingest, other times its an
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type. It's nice to know
> there's a way to store really big numbers (or really accurate numbers) if I
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and
> scale for ingest and query.
> I think one could call this FixedPoint. Every number is stored as a long,
> and scaled by a column constant. Ingest from decimal would scale and throw
> or round, configurably. Precision would be fixed at 18, or made configurable
> and verified at ingest. Stats would use longs (scaled with the column)
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits
> of precision. Or they can keep using decimal if they need 128 bits. Win/win?
This message was sent by Atlassian JIRA