Tadahito Kobayashi created ORC-468: -------------------------------------- Summary: Fix incorrect documentation for nanoseconds stream encoding Key: ORC-468 URL: https://issues.apache.org/jira/browse/ORC-468 Project: ORC Issue Type: Bug Components: documentation Reporter: Tadahito Kobayashi
According to ORC spec doc, "1000 nanoseconds would be serialized as 0x0b and 100000 would be serialized as 0x0d." However, the actual encoding result are: formatNano(1000) = 0x0a and formatNano(100000) = 0x0c. How about changing the document as below? "Because the number of nanoseconds often has a large number of trailing zeros, the number has trailing decimal zero digits removed and the last three bits are used to record how many zeros were removed {color:#FF0000}if the trailing zeros are more than 2{color}. Thus 1000 nanoseconds would be serialized as {color:#FF0000}0x0a{color} and 100000 would be serialized as {color:#FF0000}0x0c{color}." Below is my test and result to confirm nanoseconds encodings. {code:java} // this is the ORC's serialization code in ColumnWriter.cc, ORC encodes nanoseconds by this function. // https://github.com/apache/orc/blob/master/c%2B%2B/src/ColumnWriter.cc#L1669 static int64_t formatNano(int64_t nanos) { if (nanos == 0) { return 0; } else if (nanos % 100 != 0) { return (nanos) << 3; } else { nanos /= 100; int64_t trailingZeros = 1; while (nanos % 10 == 0 && trailingZeros < 7) { nanos /= 10; trailingZeros += 1; } return (nanos) << 3 | trailingZeros; } } void main() { for (int nano = 1; nano <= 1000000; nano *= 10) { printf("formatNano(%d) = 0x%02x\n", nano, formatNano(nano)); } } {code} The result: {code:java} formatNano(1) = 0x08 formatNano(10) = 0x50 formatNano(100) = 0x09 formatNano(1000) = 0x0a formatNano(10000) = 0x0b formatNano(100000) = 0x0c formatNano(1000000) = 0x0d{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)