[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

2018-03-19 Thread wgtmac
Github user wgtmac commented on the issue:

https://github.com/apache/orc/pull/233
  
Thanks @majetideepak for comment!

On the Java side, the input timestamp in writer TimestampColumnVector is in 
UTC. It leverages java.sql.Timestamp which knows the local timezone info so 
that it can PRINT in local timezone. You can print millis variable in line 109 
in TimestampTreeWriter.java to verify this. The name of 
SerializationUtils.convertToUtc(localTimezone, millis) in line 113 is kind of 
confusing, because the result is not the current timestamp in UTC but adds an 
offset to local timezone which I think it is also a problem.

ORC-10 has fixed the bug without writer timezone. The original design is to 
be resilient to move between different reader timezones. However this caused an 
issue in C++ between different daylight saving timezones and writer timezone is 
forced to be written. ORC-10 adds GMT offset is actually converting the value 
to local timezone so that ColumnPrinter can print the same time in local 
timezone. This causes a new problem that C++ reader gets timestamp value in 
local timezone, not UTC and it is different from java reader. I believe this is 
why @owen has created [ORC-37](https://issues.apache.org/jira/browse/ORC-37). 
SQL type TimestampTz is a new type other than traditional SQL type Timestamp, I 
don't think it is a good idea to mix ORC timestamp type with TimestampTz and 
there is another open issue for it: 
[ORC-189](https://issues.apache.org/jira/browse/ORC-189)

It is very confusing that an input timestamp written using Java writer is 
read differently via C++ reader. I think we need to fix it and this can also 
resolve ORC-37. What do you think?


---


Re: ORC double encoding optimization proposal

2018-03-19 Thread Gopal Vijayaraghavan
> existing work [1] from Teddy Choi and Owen O'Malley with some new compression 
> codec (e.g. ZSTD and Brotli), we proposed to prompt FLIP as the default 
> encoding for ORC double type to move this feature forwards.

Since we're discussing these, I'm going to summarize my existing notes on this, 
before you conclude.

FLIP & SPLIT are the two best algorithms from different ends of the spectrum & 
they have their own strengths.

FLIP was designed with Intel C++ code in mind, where the Java implementation is 
somewhat slower today, the C++ impl should be very fast.

In an ideal world, the entire FLIP should unroll into a single instruction - 
PSHUFB (AVX512 will be able to unpack 8x8x8 matrix, this is common in many 
platforms due to the similarity to RGBA data transforms).

At some point, we'll be able to rewrite it using Long8 native types (assuming 
JIT can understand a shuffle op).

http://hg.openjdk.java.net/panama/panama/jdk/file/tip/src/java.base/share/classes/java/lang/Long8.java#l16

Here's the tools to run through your own data to determine if FLIP will work 
for you (the byteuniq UDF).

https://github.com/t3rmin4t0r/hive-dq-tools

I haven't run HEPMASS through that script, but you can see the bit level has 
even neater entropy skews than the whole byte, but the byte packing will offer 
enough dictionary items.

https://github.com/t3rmin4t0r/zlib-zoo

shows how the LZ77 in Zlib picks the matches mostly detecting the 7 byte 
patterns instead of detecting the 8 byte patterns which definitely are common 
enough (we can have much tighter symbol detection in LZ77, though I'm more 
interested in poking about the Zstd search depth now).

There's more disk savings that can come out of FLIP.

> It's compression friendly unlike Split and FPC. 

SPLIT is very memory bandwidth friendly and is probably the best format to 
cache in-memory, because it doesn't explode in size when Zlib decompressed into 
a buffer.

SPLIT+LLAP cache is likely to be faster than FLIP+LLAP cache, purely from the 
memory bandwidth needs of the loops & the cache overflow rate of FLIP (hitting 
the cache saves the entire Zlib CPU, which is about ~30%).

The core perf issue with the SPLIT algorithm is that it doesn't decompose 
neatly at the bit-level in the java memory model - the current loop has a lot 
of room for improvement.

Basically, right now there are at least 3 branches for the SPLIT and 1 for the 
FLIP - nextNumber() is basically assembling 1 at a time, instead of 8 at a time.

Purely speaking from a decode loop perspective, we have a lot of performance 
improvements to be made in the SPLIT algorithm which are pending - that should 
ideally come as part of RLEv3 & indirectly make the SPLIT reader faster.

With a 8x unrolled impl SPLIT is going to catch up in the total decode rate & I 
started ORC-187 after digging into some of those branching loops.

For the other two next() calls, this is the equivalent unrolling that was done 
by Prasanth with Integers.

https://www.slideshare.net/t3rmin4t0r/orc-2015/23

The Long -> Double has to similarly do more register work instead of calling 
nextNumber() one at a time. And I like SPLIT because it is very natural in its 
implementation & therefore easier to parallelize than FPC.

The current numbers aren't the final numbers by a long shot, but FLIP and SPLIT 
are the ones where I feel like more work is useful.

Cheers,
Gopal




[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

2018-03-19 Thread majetideepak
Github user majetideepak commented on the issue:

https://github.com/apache/orc/pull/233
  
@wgtmac and @stiga-huang you are right that the C++ and Java writers must 
write the same value to a file for a given input timestamp value. Looks like 
the Java side writes the timestamp values provided as is in local time (no 
conversion) and writes the writer timezone in the footer (however, stats are in 
UTC). We must do the same for the C++ writer as well if not already.

ORC-10 adds GMT offset when reading the values back. Therefore, the C++ 
reader always returns values in UTC. The current behavior of ORC reader for 
timestamp values is the same as SQL `TimestampTz`.
To get the same values back ( aka SQL `Timestamp`), you need to convert the 
values read back to local time.

If you read a timestamp column from an ORC file and plan to write it 
immediately, you must first convert the values to the local time before writing.


---


ORC double encoding optimization proposal

2018-03-19 Thread Xu, Cheng A
Hi folks,
According to our evaluation and analysis by combining the existing work [1] 
from Teddy Choi and Owen O'Malley with some new compression codec (e.g. ZSTD 
and Brotli), we proposed to prompt FLIP as the default encoding for ORC double 
type to move this feature forwards.
Currently we have five kinds of supported double encoding optimizations: plain 
v2, FPC_v1, FPC_v2, FLIP, Split [1]. Equipping with various encoding can handle 
different use scenarios but it will increase the burdens from end point users 
to make a "smart" choice. It's very necessary to have a preference as the 
default encoding type. To choose the "best" encoding among those encodings, two 
major factors needed to be considered: throughput and space efficiency. In real 
world usage, compression is enabled by default. For throughput, it can be 
bottled at either compression part or encoding part. As for the space 
efficiency, , encodings like FPC V1, FPC V2 and Split can also serve for the 
goal of space efficiency similar to compression. Our evaluation is based on a 
few artificial and non-artificial data set. For the artificial data set which 
has low cardinality, it should go directly into dictionary encoding. So we will 
choose HEPMASS for the data set to reduce the complexity for analysis first. 
Now let's go through the evaluating data of those encoding one by one:

* Split
Benefiting from underlying run length encoding, split can compress the original 
data to some extends (Compression=NONE, compression ratio=47.28%). If encoding 
is chose as SPLIT, the compression ratio will not be too much better even using 
codec like ZLIB (44.36%). If SPLIT is chosen as the underlying encoding, 
compression will have negative impacts on the throughput with limited space 
efficiency.
In summary, Split has 309MB/s in read and 82MB/s in write with 47.28% 
compression ratio.

Data Set

Encoding

Compression

Read Throughput (MB/s)

Write Throughput (MB/s)

Compression Ratio

HEPMASS

SPLIT

NONE

309.5525998

82.928411

47.28%

HEPMASS

SPLIT

LZO

326.1146497

62.21142279

47.02%

HEPMASS

SPLIT

ZLIB

223.1909329

32.31915222

44.36%

HEPMASS

SPLIT

SNAPPY

340.4255319

82.71405647

46.71%

HEPMASS

SPLIT

ZSTD

295.2710496

76.87687831

43.15%

HEPMASS

SPLIT

BROTLI

174.1496599

77.06201227

43.12%



* FLIP
Since FLIP itself doesn't have compression functionality, we need to combine 
some compression codec to archive the similar compression functionalities as 
Split. On HEPMASS data set, FLIP has 640MB/s in read and 247MB/s in write with 
61.59% compression ratio using LZO. And also it has 266MB/s in READ and 145MB/s 
in write with 52.39%. In summary, FLIP is a good balance for space efficiency 
and throughput and user can choose different compression codec to choose 
different goal (high compression or high throughput).

Data Set

Encoding

Compression

Read Throughput (MB/s)

Write Throughput (MB/s)

Compression Ratio

HEPMASS

FLIP

NONE

775.7575758

587.1559742

100.00%

HEPMASS

FLIP

LZO

640

247.1042517

61.59%

HEPMASS

FLIP

ZLIB

272.0510096

19.21056617

52.95%

HEPMASS

FLIP

SNAPPY

595.3488372

261.4913225

59.49%

HEPMASS

FLIP

ZSTD

266.3891779

145.7858797

52.39%

HEPMASS

FLIP

BROTLI

64.84295846

122.3709392

53.11%



* FPC V1
FPC V1 is similar to FPC V2 with little difference in endian mode. We choose 
FPC V1 in our analysis. FPC_V1 can also serve for the compression (75.76% 
compression ratio when compression=NONE). Like Split, extended compression 
codec does not contribute too much higher compression ratio (66.40%-75%) while 
bottleneck the throughput. In summary, FPC is not compression friendly (66% - 
75%) while throughput is close to FLIP and worse than it with compression codec 
applied.

Data Set

Encoding

Compression

Read Throughput (MB/s)

Write Throughput (MB/s)

Compression Ratio

HEPMASS

FPC_V1

NONE

469.7247706

324.8731025

75.76%

HEPMASS

FPC_V1

LZO

474.0740741

310.6796174

75.76%

HEPMASS

FPC_V1

ZLIB

189.2091648

20.84011761

66.19%

HEPMASS

FPC_V1

SNAPPY

456.3279857

298.7164583

75.51%

HEPMASS

FPC_V1

ZSTD

238.1395349

183.7760264

66.35%

HEPMASS

FPC_V1

BROTLI

53.50052247

165.695796

66.40%



* Plain V2
To archive a good balance of compression and throughput, we need to consider 
ZSTD or ZLIB for the compression codec. Then its throughput will be 200 ~ 600 
MB/s in READ and 100 ~ 1000MB/s in write. It's also not good enough as FLIP 
encoding.

Data Set

Encoding

Compression

Read Throughput (MB/s)

Write Throughput (MB/s)

Compression Ratio

HEPMASS

PLAIN_V2

NONE

349.2496589

1153.153175

100.00%

HEPMASS

PLAIN_V2

LZO

571.4285714

236.162366

65.07%

HEPMASS

PLAIN_V2

ZLIB

304.3995244

10.14544465

54.15%

HEPMASS

PLAIN_V2

SNAPPY

442.1416235

278.5636613

76.60%

HEPMASS

PLAIN_V2

ZSTD

242.6540284

144.0630303

56.25%

HEPMASS

PLAIN_V2

BROTLI

43.99381337

96.67673896

59.33%


Other data set, we can observe a similar