Tadahito Kobayashi created ORC-468:
--------------------------------------

             Summary: Fix incorrect documentation for nanoseconds stream 
encoding
                 Key: ORC-468
                 URL: https://issues.apache.org/jira/browse/ORC-468
             Project: ORC
          Issue Type: Bug
          Components: documentation
            Reporter: Tadahito Kobayashi


According to ORC spec doc, "1000 nanoseconds would be serialized as 0x0b and 
100000 would be serialized as 0x0d."
However, the actual encoding result are: formatNano(1000) = 0x0a and 
formatNano(100000) = 0x0c.

How about changing the document as below?

"Because the number of nanoseconds often has a large number of trailing zeros, 
the number has trailing decimal zero digits removed and the last three bits are 
used to record how many zeros were removed {color:#FF0000}if the trailing zeros 
are more than 2{color}. Thus 1000 nanoseconds would be serialized as 
{color:#FF0000}0x0a{color} and 100000 would be serialized as 
{color:#FF0000}0x0c{color}."


Below is my test and result to confirm nanoseconds encodings.

 
{code:java}
// this is the ORC's serialization code in ColumnWriter.cc, ORC encodes 
nanoseconds by this function.
// https://github.com/apache/orc/blob/master/c%2B%2B/src/ColumnWriter.cc#L1669
static int64_t formatNano(int64_t nanos) {
 if (nanos == 0) {
 return 0;
 }
 else if (nanos % 100 != 0) {
 return (nanos) << 3;
 }
 else {
 nanos /= 100;
 int64_t trailingZeros = 1;
 while (nanos % 10 == 0 && trailingZeros < 7) {
 nanos /= 10;
 trailingZeros += 1;
 }
 return (nanos) << 3 | trailingZeros;
 }
}
void main()
{
 for (int nano = 1; nano <= 1000000; nano *= 10) {
 printf("formatNano(%d) = 0x%02x\n", nano, formatNano(nano));
 }
}
{code}
 

The result:
{code:java}
formatNano(1) = 0x08
formatNano(10) = 0x50
formatNano(100) = 0x09
formatNano(1000) = 0x0a
formatNano(10000) = 0x0b
formatNano(100000) = 0x0c
formatNano(1000000) = 0x0d{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to