[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...
Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/247#discussion_r181239251 --- Diff: site/specification/ORCv2.md --- @@ -0,0 +1,1032 @@ +--- +layout: page +title: Evolving Draft for ORC Specification v2 +--- + +This specification is rapidly evolving and should only be used for +developers on the project. + +# TO DO items + +The list of things that we plan to change: + +* Create a decimal representation with fixed scale using rle. +* Create a better float/double encoding that splits mantissa and + exponent. +* Create a dictionary encoding for float, double, and decimal. +* Create RLEv3: + * 64 and 128 bit variants + * Zero suppression + * Evaluate the rle subformats +* Group stripe data into stripelets to enable Async IO for reads. +* Reorder stripe data into (stripe metadata, index, dictionary, data) +* Stop sorting dictionaries and record the sort order separately in the index. +* Remove use of RLEv1 and RLEv2. +* Remove non-utf8 bloom filter. +* Use numeric value for decimal bloom filter. --- End diff -- We may also use numeric value for decimal column statistics ---
[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/247 ORC-339. Reorganize the ORC file format specification. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-339 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/247.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #247 commit 5c56d74d948a73f5c456e0e80ff0622505d6c1cf Author: Owen O'Malley Date: 2018-04-12T22:03:00Z ORC-339. Reorganize the ORC file format specification. ---
[jira] [Created] (ORC-339) Reorganize ORC specification
Owen O'Malley created ORC-339: - Summary: Reorganize ORC specification Key: ORC-339 URL: https://issues.apache.org/jira/browse/ORC-339 Project: ORC Issue Type: Improvement Reporter: Owen O'Malley Assignee: Owen O'Malley Currently we've put the ORC format specification in the documentation. Now that we are starting the work to design ORCv2, it will be more convenient to have each file format version as a separate page. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] orc issue #244: Add documentation for C++
Github user wgtmac commented on the issue: https://github.com/apache/orc/pull/244 I just committed this. Thanks @majetideepak for review! ---
[GitHub] orc pull request #244: Add documentation for C++
Github user asfgit closed the pull request at: https://github.com/apache/orc/pull/244 ---
[GitHub] orc issue #245: ORC-161: Proposal for new decimal encodings and statistics.
Github user wgtmac commented on the issue: https://github.com/apache/orc/pull/245 Will provide them after comprehensive benchmark. ---
[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...
Github user t3rmin4t0r commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181203617 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 --- End diff -- Some part of this discussion is about the new ORC format and existing reader compatibility is not a requirement, until we switch to the new format as a default. ---
[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...
Github user t3rmin4t0r commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181202668 --- Diff: site/_docs/encodings.md --- @@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE Decimal was introduced in Hive 0.11 with infinite precision (the total number of digits). In Hive 0.13, the definition was change to limit the precision to a maximum of 38 digits, which conveniently uses 127 -bits plus a sign bit. The current encoding of decimal columns stores -the integer representation of the value as an unbounded length zigzag -encoded base 128 varint. The scale is stored in the SECONDARY stream -as an signed integer. +bits plus a sign bit. + +DIRECT and DIRECT_V2 encodings of decimal columns stores the integer +representation of the value as an unbounded length zigzag encoded base +128 varint. The scale is stored in the SECONDARY stream as an signed +integer. + +In ORC 2.0, DECIMAL encoding is introduced and totally remove scale +stream as all decimal values use the same scale. When precision is +no greater than 18, decimal values can be fully represented by DATA +stream which stores 64-bit signed integers. When precision is greater +than 18, we use a 128-bit signed integer to store the decimal value. +DATA stream stores the higher 64 bits and SECONDARY stream holds the +lower 64 bits. Both streams use signed integer RLE v2. --- End diff -- The multiple-stream + row-group stride problems for IO were discussed by Owen. The disk layout is what matters for IO, not the logical stream separation. ---
[GitHub] orc issue #245: ORC-161: Proposal for new decimal encodings and statistics.
Github user prasanthj commented on the issue: https://github.com/apache/orc/pull/245 "we found RLEv1 + zstd may be the best combination than others in terms of both compression ration and encoding/decoding speed." do you have experimental numbers for this? ---
[GitHub] orc issue #245: ORC-161: Proposal for new decimal encodings and statistics.
Github user wgtmac commented on the issue: https://github.com/apache/orc/pull/245 After second thought, I added back DECIMAL_V1 to support RLE v1 in decimal encoding. The reason is that in our testing, we found RLEv1 + zstd may be the best combination than others in terms of both compression ration and encoding/decoding speed. ---
[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...
Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181168352 --- Diff: site/_docs/encodings.md --- @@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE Decimal was introduced in Hive 0.11 with infinite precision (the total number of digits). In Hive 0.13, the definition was change to limit the precision to a maximum of 38 digits, which conveniently uses 127 -bits plus a sign bit. The current encoding of decimal columns stores -the integer representation of the value as an unbounded length zigzag -encoded base 128 varint. The scale is stored in the SECONDARY stream -as an signed integer. +bits plus a sign bit. + +DIRECT and DIRECT_V2 encodings of decimal columns stores the integer +representation of the value as an unbounded length zigzag encoded base +128 varint. The scale is stored in the SECONDARY stream as an signed +integer. + +In ORC 2.0, DECIMAL encoding is introduced and totally remove scale +stream as all decimal values use the same scale. When precision is +no greater than 18, decimal values can be fully represented by DATA +stream which stores 64-bit signed integers. When precision is greater +than 18, we use a 128-bit signed integer to store the decimal value. +DATA stream stores the higher 64 bits and SECONDARY stream holds the +lower 64 bits. Both streams use signed integer RLE v2. --- End diff -- The main problem is that we don't have 128-bit integer RLE on hand. ---
[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...
Github user dain commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181164570 --- Diff: site/_docs/encodings.md --- @@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE Decimal was introduced in Hive 0.11 with infinite precision (the total number of digits). In Hive 0.13, the definition was change to limit the precision to a maximum of 38 digits, which conveniently uses 127 -bits plus a sign bit. The current encoding of decimal columns stores -the integer representation of the value as an unbounded length zigzag -encoded base 128 varint. The scale is stored in the SECONDARY stream -as an signed integer. +bits plus a sign bit. + +DIRECT and DIRECT_V2 encodings of decimal columns stores the integer +representation of the value as an unbounded length zigzag encoded base +128 varint. The scale is stored in the SECONDARY stream as an signed +integer. + +In ORC 2.0, DECIMAL encoding is introduced and totally remove scale +stream as all decimal values use the same scale. When precision is +no greater than 18, decimal values can be fully represented by DATA +stream which stores 64-bit signed integers. When precision is greater +than 18, we use a 128-bit signed integer to store the decimal value. +DATA stream stores the higher 64 bits and SECONDARY stream holds the +lower 64 bits. Both streams use signed integer RLE v2. --- End diff -- Why split the data across two streams? This means 2 IOs (or one large coalesced IO) to read the values (assuming no nulls). Instead, can't we put all 128 bits in one stream? ---
[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...
Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181158484 --- Diff: site/_docs/file-tail.md --- @@ -249,12 +249,25 @@ For booleans, the statistics include the count of false and true values. } ``` -For decimals, the minimum, maximum, and sum are stored. +For decimals, the minimum, maximum, and sum are stored. In ORC 2.0, +string representation is deprecated and DecimalStatistics uses integers +which have better performance. ```message DecimalStatistics { optional string minimum = 1; optional string maximum = 2; optional string sum = 3; + message Int128 { + repeated sint64 highBits = 1; + repeated uint64 lowBits = 2; --- End diff -- Here I was aligning with C++ orc::Int128's implementation to avoid many casts. ---
[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...
Github user wgtmac commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181157700 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 --- End diff -- @majetideepak We are already working on it and doing test & benchmark. Will contribute back but may not be that soon. ---
[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181155751 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 --- End diff -- @xndai Vertica is interested in getting RLE v2 for C++ as well. Do you think we can collaborate on getting this in quickly? ---
[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...
Github user xndai commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181149073 --- Diff: site/_docs/encodings.md --- @@ -123,6 +127,41 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA| No | Unbounded base 128 varints | SECONDARY | No | Unsigned Integer RLE v2 +In ORC 2.0, DECIMAL and DECIMAL_V2 encodings are introduced and scale +stream is totally removed as all decimal values use the same scale. +There are two difference cases: precision<=18 and precision>18. + +### Decimal Encoding for precision <= 18 + +When precision is no greater than 18, decimal values can be fully +represented by 64-bit signed integers which are stored in DATA stream +and use signed integer RLE. + +Encoding | Stream Kind | Optional | Contents +: | :-- | :--- | :--- +DECIMAL | PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v1 +DECIMAL_V2| PRESENT | Yes | Boolean RLE + | DATA| No | Signed Integer RLE v2 --- End diff -- I think we should keep RLE v1 as an option. The C++ writer currently does not support RLE v2 (we are working on it). We don't want the new decimal writer to have dependency on that. ---
[GitHub] orc pull request #246: ORC-338. Workaround C++ compiler bug in xcode 9.3 by ...
Github user asfgit closed the pull request at: https://github.com/apache/orc/pull/246 ---
[GitHub] orc pull request #227: ORC-318. Change KeyProvider API to separate createLoc...
Github user asfgit closed the pull request at: https://github.com/apache/orc/pull/227 ---
[GitHub] orc pull request #246: ORC-338. Workaround C++ compiler bug in xcode 9.3 by ...
GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/246 ORC-338. Workaround C++ compiler bug in xcode 9.3 by removing an inline function. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-338 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/246.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #246 ---
[jira] [Created] (ORC-338) Workaround C++ compiler bug in newest clang including xcode 9.3
Owen O'Malley created ORC-338: - Summary: Workaround C++ compiler bug in newest clang including xcode 9.3 Key: ORC-338 URL: https://issues.apache.org/jira/browse/ORC-338 Project: ORC Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley The ColumnStatistics.intColumnStatistics test fails in the xcode 9.3 if you use the release build, but passes in the debug build. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...
Github user majetideepak commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181096194 --- Diff: site/_docs/file-tail.md --- @@ -249,12 +249,25 @@ For booleans, the statistics include the count of false and true values. } ``` -For decimals, the minimum, maximum, and sum are stored. +For decimals, the minimum, maximum, and sum are stored. In ORC 2.0, +string representation is deprecated and DecimalStatistics uses integers +which have better performance. ```message DecimalStatistics { optional string minimum = 1; optional string maximum = 2; optional string sum = 3; + message Int128 { + repeated sint64 highBits = 1; + repeated uint64 lowBits = 2; --- End diff -- shouldn't this be sint64 as well since we are using uint64 for the SECONDARY stream? ---