[ https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434889#comment-16434889 ]
ASF GitHub Bot commented on ORC-161: ------------------------------------ Github user t3rmin4t0r commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r180957651 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA | No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +:------------ | :-------------- | :------- | :------- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA | No | Signed Integer RLE v1 +DECIMAL_V2 | PRESENT | Yes | Boolean RLE + | DATA | No | Signed Integer RLE v2 + +### Decimal Encoding for precision > 18 + +When precision is greater than 18, decimal value is split into two +parts: a signed integer stores higher 64 bits and an unsigned integer +stores lower 64 bits. Therefore, a DATA stream is utilized to store +the higher 64-bit signed integer of decimal values and a SECONDARY +stream holds the lower 64-bit unsigned integer of decimal values. +Both streams use RLE and are not optional in this case. + +Encoding | Stream Kind | Optional | Contents +:------------ | :-------------- | :------- | :------- +DECIMAL | PRESENT | Yes | Boolean RLE --- End diff -- This is when Decimal v1 can be retired from the encodings. > Create a new column type that run-length-encodes decimals > --------------------------------------------------------- > > Key: ORC-161 > URL: https://issues.apache.org/jira/browse/ORC-161 > Project: ORC > Issue Type: Wish > Components: encoding > Reporter: Douglas Drinka > Priority: Major > > I'm storing prices in ORC format, and have made the following observations > about the current decimal implementation: > - The encoding is inefficient: my prices are a walking-random set, plus or > minus a few pennies per data point. This would encode beautifully with a > patched base encoding. Instead I'm averaging 4 bytes per data point, after > Zlib. > - Everyone acknowledges that it's nice to be able to store huge numbers in > decimal columns, but that you probably won't. Presto, for instance, has a > fast-path which engages for precision of 18 or less, and decodes to 64-bit > longs, and then a slow path which uses BigInt. I anticipate the majority of > implementations fit the decimal(18,6) use case. > - The whole concept of precision/scale, along with a dedicated scale per data > point is messy. Sometimes it's checked on data ingest, other times its an > error on reading, or else it's cast (and rounded?) > I don't propose eliminating the current column type. It's nice to know > there's a way to store really big numbers (or really accurate numbers) if I > need that in the future. > But I'd like to see a new column that uses the existing Run Length Encoding > functionality, and is limited to 63+1 bit numbers, with a fixed precision and > scale for ingest and query. > I think one could call this FixedPoint. Every number is stored as a long, > and scaled by a column constant. Ingest from decimal would scale and throw > or round, configurably. Precision would be fixed at 18, or made configurable > and verified at ingest. Stats would use longs (scaled with the column) > rather than strings. > Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits > of precision. Or they can keep using decimal if they need 128 bits. Win/win? -- This message was sent by Atlassian JIRA (v7.6.3#76005)