Thanks Gopal for the links and they are very helpful! As Owen has suggested to create a patch to ORC specs for the proposal, this PR: https://github.com/apache/orc/pull/245 is created for discussion.
If we are all on the same page and finalize the proposal, we can start coding afterwards. Any comments are welcome. Thanks! Gang On Tue, Apr 10, 2018 at 4:22 PM, Gopal Vijayaraghavan <gop...@apache.org> wrote: > Hi, > > I agree with your analysis about Decimals. > > Something similar has already gone into patch-available previously, but > held back > > https://issues.apache.org/jira/browse/ORC-209 > > This is somewhat stuck behind the Vector type system evolving support for > this > > https://issues.apache.org/jira/browse/HIVE-17235 > + > https://issues.apache.org/jira/browse/HIVE-17433 > > That works only for Text formats in Hive right now, hope that bolsters > your case for implementing it in ORC. > > Definite +1 on the DecimalStatistics impl, the String representation is > the slowest part of the Decimal writer loop in Java (i.e for every row, > toString + compare min/max). > > Cheers, > Gopal > > > > > On 4/10/18, 1:24 PM, "Wu Gang" <ust...@gmail.com> wrote: > > Hi, > > This is Gang Wu and I have proposed this in ORC-161 > <https://issues.apache.org/jira/browse/ORC-161> but got no response > therefore I put it here. > > Recently I have done some benchmarks between ORC and our proprietary > file > format. The result indicates that ORC does not have a good performance > on > decimal type. From the aforementioned discussion in this JIRA, I have a > proposal for adding a new encoding approach for decimal type (I don't > think > adding another kind of decimal type is a good choose which may confuse > users). My proposal works as follows: > > - As Hive already has precision and scale specified in the type, we > can > totally remove the SECONDARY stream which stores scale of each > element > currently. > > > - Since 128-bit integer is used to represent a decimal value and RLE > supports at most 64-bit integer, we have two cases here. > - If precision <= 18, then the whole decimal value can be > represented in > a signed 64-bit integer. Therefore we only need a DATA stream > and use > signed integer RLE to encode it. > - If precision > 18, then we need to use a signed 128-bit > integer. A > solution is to use a signed 64-bit integer to hold higher 64 > bits and an > unsigned 64-bit integer to hold the lower 64 bits (C++ version > is exactly > doing the same thing). In this way, we can use DATA stream with > signed > integer RLE to store higher 64 bits and SECONDARY stream with > unsigned > integer RLE to store lower 64 bits. > > > - DecimalStatistics uses string type to store min/max/sum. We may > also > replace them with combination of sint64 and uint64 as above to > represent a > 128-bit integer. This can help save a lot of space. > > Any thoughts? > > Best, > Gang > > > >