Douglas Drinka created ORC-161:
----------------------------------
Summary: Create a new column type that run-length-encodes decimals
Key: ORC-161
URL: https://issues.apache.org/jira/browse/ORC-161
Project: ORC
Issue Type: Wish
Components: encoding
Reporter: Douglas Drinka
I'm storing prices in ORC format, and have made the following observations
about the current decimal implementation:
- The encoding is inefficient: my prices are a walking-random set, plus or
minus a few pennies per data point. This would encode beautifully with a
patched base encoding. Instead I'm averaging 4 bytes per data point, after
Zlib.
- Everyone acknowledges that it's nice to be able to store huge numbers in
decimal columns, but that you probably won't. Presto, for instance, has a
fast-path which engages for precision of 18 or less, and decodes to 64-bit
longs, and then a slow path which uses BigInt. I anticipate the majority of
implementations fit the decimal(18,6) use case.
- The whole concept of precision/scale, along with a dedicated scale per data
point is messy. Sometimes it's checked on data ingest, other times its an
error on reading, or else it's cast (and rounded?)
I don't propose eliminating the current column type. It's nice to know there's
a way to store really big numbers (or really accurate numbers) if I need that
in the future.
But I'd like to see a new column that uses the existing Run Length Encoding
functionality, and is limited to 63+1 bit numbers, with a fixed precision and
scale for ingest and query.
I think one could call this FixedPoint. Every number is stored as a long, and
scaled by a column constant. Ingest from decimal would scale and throw or
round, configurably. Precision would be fixed at 18, or made configurable and
verified at ingest. Stats would use longs (scaled with the column) rather than
strings.
Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits of
precision. Or they can keep using decimal if they need 128 bits. Win/win?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)