[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp
Github user wgtmac commented on the issue: https://github.com/apache/orc/pull/233 Thanks @majetideepak for comment! On the Java side, the input timestamp in writer TimestampColumnVector is in UTC. It leverages java.sql.Timestamp which knows the local timezone info so that it can PRINT in local timezone. You can print millis variable in line 109 in TimestampTreeWriter.java to verify this. The name of SerializationUtils.convertToUtc(localTimezone, millis) in line 113 is kind of confusing, because the result is not the current timestamp in UTC but adds an offset to local timezone which I think it is also a problem. ORC-10 has fixed the bug without writer timezone. The original design is to be resilient to move between different reader timezones. However this caused an issue in C++ between different daylight saving timezones and writer timezone is forced to be written. ORC-10 adds GMT offset is actually converting the value to local timezone so that ColumnPrinter can print the same time in local timezone. This causes a new problem that C++ reader gets timestamp value in local timezone, not UTC and it is different from java reader. I believe this is why @owen has created [ORC-37](https://issues.apache.org/jira/browse/ORC-37). SQL type TimestampTz is a new type other than traditional SQL type Timestamp, I don't think it is a good idea to mix ORC timestamp type with TimestampTz and there is another open issue for it: [ORC-189](https://issues.apache.org/jira/browse/ORC-189) It is very confusing that an input timestamp written using Java writer is read differently via C++ reader. I think we need to fix it and this can also resolve ORC-37. What do you think? ---
Re: ORC double encoding optimization proposal
> existing work [1] from Teddy Choi and Owen O'Malley with some new compression > codec (e.g. ZSTD and Brotli), we proposed to prompt FLIP as the default > encoding for ORC double type to move this feature forwards. Since we're discussing these, I'm going to summarize my existing notes on this, before you conclude. FLIP & SPLIT are the two best algorithms from different ends of the spectrum & they have their own strengths. FLIP was designed with Intel C++ code in mind, where the Java implementation is somewhat slower today, the C++ impl should be very fast. In an ideal world, the entire FLIP should unroll into a single instruction - PSHUFB (AVX512 will be able to unpack 8x8x8 matrix, this is common in many platforms due to the similarity to RGBA data transforms). At some point, we'll be able to rewrite it using Long8 native types (assuming JIT can understand a shuffle op). http://hg.openjdk.java.net/panama/panama/jdk/file/tip/src/java.base/share/classes/java/lang/Long8.java#l16 Here's the tools to run through your own data to determine if FLIP will work for you (the byteuniq UDF). https://github.com/t3rmin4t0r/hive-dq-tools I haven't run HEPMASS through that script, but you can see the bit level has even neater entropy skews than the whole byte, but the byte packing will offer enough dictionary items. https://github.com/t3rmin4t0r/zlib-zoo shows how the LZ77 in Zlib picks the matches mostly detecting the 7 byte patterns instead of detecting the 8 byte patterns which definitely are common enough (we can have much tighter symbol detection in LZ77, though I'm more interested in poking about the Zstd search depth now). There's more disk savings that can come out of FLIP. > It's compression friendly unlike Split and FPC. SPLIT is very memory bandwidth friendly and is probably the best format to cache in-memory, because it doesn't explode in size when Zlib decompressed into a buffer. SPLIT+LLAP cache is likely to be faster than FLIP+LLAP cache, purely from the memory bandwidth needs of the loops & the cache overflow rate of FLIP (hitting the cache saves the entire Zlib CPU, which is about ~30%). The core perf issue with the SPLIT algorithm is that it doesn't decompose neatly at the bit-level in the java memory model - the current loop has a lot of room for improvement. Basically, right now there are at least 3 branches for the SPLIT and 1 for the FLIP - nextNumber() is basically assembling 1 at a time, instead of 8 at a time. Purely speaking from a decode loop perspective, we have a lot of performance improvements to be made in the SPLIT algorithm which are pending - that should ideally come as part of RLEv3 & indirectly make the SPLIT reader faster. With a 8x unrolled impl SPLIT is going to catch up in the total decode rate & I started ORC-187 after digging into some of those branching loops. For the other two next() calls, this is the equivalent unrolling that was done by Prasanth with Integers. https://www.slideshare.net/t3rmin4t0r/orc-2015/23 The Long -> Double has to similarly do more register work instead of calling nextNumber() one at a time. And I like SPLIT because it is very natural in its implementation & therefore easier to parallelize than FPC. The current numbers aren't the final numbers by a long shot, but FLIP and SPLIT are the ones where I feel like more work is useful. Cheers, Gopal
[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp
Github user majetideepak commented on the issue: https://github.com/apache/orc/pull/233 @wgtmac and @stiga-huang you are right that the C++ and Java writers must write the same value to a file for a given input timestamp value. Looks like the Java side writes the timestamp values provided as is in local time (no conversion) and writes the writer timezone in the footer (however, stats are in UTC). We must do the same for the C++ writer as well if not already. ORC-10 adds GMT offset when reading the values back. Therefore, the C++ reader always returns values in UTC. The current behavior of ORC reader for timestamp values is the same as SQL `TimestampTz`. To get the same values back ( aka SQL `Timestamp`), you need to convert the values read back to local time. If you read a timestamp column from an ORC file and plan to write it immediately, you must first convert the values to the local time before writing. ---
ORC double encoding optimization proposal
Hi folks, According to our evaluation and analysis by combining the existing work [1] from Teddy Choi and Owen O'Malley with some new compression codec (e.g. ZSTD and Brotli), we proposed to prompt FLIP as the default encoding for ORC double type to move this feature forwards. Currently we have five kinds of supported double encoding optimizations: plain v2, FPC_v1, FPC_v2, FLIP, Split [1]. Equipping with various encoding can handle different use scenarios but it will increase the burdens from end point users to make a "smart" choice. It's very necessary to have a preference as the default encoding type. To choose the "best" encoding among those encodings, two major factors needed to be considered: throughput and space efficiency. In real world usage, compression is enabled by default. For throughput, it can be bottled at either compression part or encoding part. As for the space efficiency, , encodings like FPC V1, FPC V2 and Split can also serve for the goal of space efficiency similar to compression. Our evaluation is based on a few artificial and non-artificial data set. For the artificial data set which has low cardinality, it should go directly into dictionary encoding. So we will choose HEPMASS for the data set to reduce the complexity for analysis first. Now let's go through the evaluating data of those encoding one by one: * Split Benefiting from underlying run length encoding, split can compress the original data to some extends (Compression=NONE, compression ratio=47.28%). If encoding is chose as SPLIT, the compression ratio will not be too much better even using codec like ZLIB (44.36%). If SPLIT is chosen as the underlying encoding, compression will have negative impacts on the throughput with limited space efficiency. In summary, Split has 309MB/s in read and 82MB/s in write with 47.28% compression ratio. Data Set Encoding Compression Read Throughput (MB/s) Write Throughput (MB/s) Compression Ratio HEPMASS SPLIT NONE 309.5525998 82.928411 47.28% HEPMASS SPLIT LZO 326.1146497 62.21142279 47.02% HEPMASS SPLIT ZLIB 223.1909329 32.31915222 44.36% HEPMASS SPLIT SNAPPY 340.4255319 82.71405647 46.71% HEPMASS SPLIT ZSTD 295.2710496 76.87687831 43.15% HEPMASS SPLIT BROTLI 174.1496599 77.06201227 43.12% * FLIP Since FLIP itself doesn't have compression functionality, we need to combine some compression codec to archive the similar compression functionalities as Split. On HEPMASS data set, FLIP has 640MB/s in read and 247MB/s in write with 61.59% compression ratio using LZO. And also it has 266MB/s in READ and 145MB/s in write with 52.39%. In summary, FLIP is a good balance for space efficiency and throughput and user can choose different compression codec to choose different goal (high compression or high throughput). Data Set Encoding Compression Read Throughput (MB/s) Write Throughput (MB/s) Compression Ratio HEPMASS FLIP NONE 775.7575758 587.1559742 100.00% HEPMASS FLIP LZO 640 247.1042517 61.59% HEPMASS FLIP ZLIB 272.0510096 19.21056617 52.95% HEPMASS FLIP SNAPPY 595.3488372 261.4913225 59.49% HEPMASS FLIP ZSTD 266.3891779 145.7858797 52.39% HEPMASS FLIP BROTLI 64.84295846 122.3709392 53.11% * FPC V1 FPC V1 is similar to FPC V2 with little difference in endian mode. We choose FPC V1 in our analysis. FPC_V1 can also serve for the compression (75.76% compression ratio when compression=NONE). Like Split, extended compression codec does not contribute too much higher compression ratio (66.40%-75%) while bottleneck the throughput. In summary, FPC is not compression friendly (66% - 75%) while throughput is close to FLIP and worse than it with compression codec applied. Data Set Encoding Compression Read Throughput (MB/s) Write Throughput (MB/s) Compression Ratio HEPMASS FPC_V1 NONE 469.7247706 324.8731025 75.76% HEPMASS FPC_V1 LZO 474.0740741 310.6796174 75.76% HEPMASS FPC_V1 ZLIB 189.2091648 20.84011761 66.19% HEPMASS FPC_V1 SNAPPY 456.3279857 298.7164583 75.51% HEPMASS FPC_V1 ZSTD 238.1395349 183.7760264 66.35% HEPMASS FPC_V1 BROTLI 53.50052247 165.695796 66.40% * Plain V2 To archive a good balance of compression and throughput, we need to consider ZSTD or ZLIB for the compression codec. Then its throughput will be 200 ~ 600 MB/s in READ and 100 ~ 1000MB/s in write. It's also not good enough as FLIP encoding. Data Set Encoding Compression Read Throughput (MB/s) Write Throughput (MB/s) Compression Ratio HEPMASS PLAIN_V2 NONE 349.2496589 1153.153175 100.00% HEPMASS PLAIN_V2 LZO 571.4285714 236.162366 65.07% HEPMASS PLAIN_V2 ZLIB 304.3995244 10.14544465 54.15% HEPMASS PLAIN_V2 SNAPPY 442.1416235 278.5636613 76.60% HEPMASS PLAIN_V2 ZSTD 242.6540284 144.0630303 56.25% HEPMASS PLAIN_V2 BROTLI 43.99381337 96.67673896 59.33% Other data set, we can observe a similar