ASF GitHub Bot commented on ORC-161:

Github user omalley commented on a diff in the pull request:

    --- Diff: site/_docs/encodings.md ---
    @@ -109,10 +109,20 @@ DIRECT_V2     | PRESENT         | Yes      | Boolean 
     Decimal was introduced in Hive 0.11 with infinite precision (the total
     number of digits). In Hive 0.13, the definition was change to limit
     the precision to a maximum of 38 digits, which conveniently uses 127
    -bits plus a sign bit. The current encoding of decimal columns stores
    -the integer representation of the value as an unbounded length zigzag
    -encoded base 128 varint. The scale is stored in the SECONDARY stream
    -as an signed integer.
    +bits plus a sign bit.
    +DIRECT and DIRECT_V2 encodings of decimal columns stores the integer
    +representation of the value as an unbounded length zigzag encoded base
    +128 varint. The scale is stored in the SECONDARY stream as an signed
    +In ORC 2.0, DECIMAL_V1 and DECIMAL_V2 encodins are introduced and
    --- End diff --
    In ORCv2, we'll just pick a RLE and not leave it pickable.
    In terms of the encoding names, I'm a bit torn. My original inclination 
would be to use DECIMAL64 and DECIMAL128 as encoding names. However, It would 
be nice to have the ability to use dictionaries, so we'd need dictionary forms 
of them too. Thoughts?

> Create a new column type that run-length-encodes decimals
> ---------------------------------------------------------
>                 Key: ORC-161
>                 URL: https://issues.apache.org/jira/browse/ORC-161
>             Project: ORC
>          Issue Type: Wish
>          Components: encoding
>            Reporter: Douglas Drinka
>            Priority: Major
> I'm storing prices in ORC format, and have made the following observations 
> about the current decimal implementation:
> - The encoding is inefficient: my prices are a walking-random set, plus or 
> minus a few pennies per data point. This would encode beautifully with a 
> patched base encoding.  Instead I'm averaging 4 bytes per data point, after 
> Zlib.
> - Everyone acknowledges that it's nice to be able to store huge numbers in 
> decimal columns, but that you probably won't.  Presto, for instance, has a 
> fast-path which engages for precision of 18 or less, and decodes to 64-bit 
> longs, and then a slow path which uses BigInt.  I anticipate the majority of 
> implementations fit the decimal(18,6) use case.
> - The whole concept of precision/scale, along with a dedicated scale per data 
> point is messy.  Sometimes it's checked on data ingest, other times its an 
> error on reading, or else it's cast (and rounded?)
> I don't propose eliminating the current column type.  It's nice to know 
> there's a way to store really big numbers (or really accurate numbers) if I 
> need that in the future.
> But I'd like to see a new column that uses the existing Run Length Encoding 
> functionality, and is limited to 63+1 bit numbers, with a fixed precision and 
> scale for ingest and query.
> I think one could call this FixedPoint.  Every number is stored as a long, 
> and scaled by a column constant.  Ingest from decimal would scale and throw 
> or round, configurably.  Precision would be fixed at 18, or made configurable 
> and verified at ingest.  Stats would use longs (scaled with the column) 
> rather than strings.
> Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits 
> of precision.  Or they can keep using decimal if they need 128 bits.  Win/win?

This message was sent by Atlassian JIRA

Reply via email to