[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

ASF GitHub Bot (JIRA) Wed, 11 Apr 2018 20:42:34 -0700

    [ 
https://issues.apache.org/jira/browse/ORC-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434889#comment-16434889
 ]


ASF GitHub Bot commented on ORC-161:
------------------------------------

Github user t3rmin4t0r commented on a diff in the pull request:

    https://github.com/apache/orc/pull/245#discussion_r180957651
  
    --- Diff: site/_docs/encodings.md ---
    @@ -123,6 +127,41 @@ DIRECT_V2     | PRESENT         | Yes      | Boolean 
RLE
                   | DATA            | No       | Unbounded base 128 varints
                   | SECONDARY       | No       | Unsigned Integer RLE v2
     
    +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale
    +stream is totally removed as all decimal values use the same scale.
    +There are two difference cases: precision<=18 and precision>18.
    +
    +### Decimal Encoding for precision <= 18
    +
    +When precision is no greater than 18, decimal values can be fully
    +represented by 64-bit signed integers which are stored in DATA stream
    +and use signed integer RLE.
    +
    +Encoding      | Stream Kind     | Optional | Contents
    +:------------ | :-------------- | :------- | :-------
    +DECIMAL       | PRESENT         | Yes      | Boolean RLE
    +              | DATA            | No       | Signed Integer RLE v1
    +DECIMAL_V2    | PRESENT         | Yes      | Boolean RLE
    +              | DATA            | No       | Signed Integer RLE v2
    +
    +### Decimal Encoding for precision > 18
    +
    +When precision is greater than 18, decimal value is split into two
    +parts: a signed integer stores higher 64 bits and an unsigned integer
    +stores lower 64 bits. Therefore, a DATA stream is utilized to store
    +the higher 64-bit signed integer of decimal values and a SECONDARY
    +stream holds the lower 64-bit unsigned integer of decimal values.
    +Both streams use RLE and are not optional in this case.
    +
    +Encoding      | Stream Kind     | Optional | Contents
    +:------------ | :-------------- | :------- | :-------
    +DECIMAL       | PRESENT         | Yes      | Boolean RLE
    --- End diff --
    
    This is when Decimal v1 can be retired from the encodings.


> Create a new column type that run-length-encodes decimals
> ---------------------------------------------------------
>
>                 Key: ORC-161
>                 URL: https://issues.apache.org/jira/browse/ORC-161
>             Project: ORC
>          Issue Type: Wish
>          Components: encoding
>            Reporter: Douglas Drinka
>            Priority: Major
>
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ORC-161) Create a new column type that run-length-encodes decimals

Reply via email to