Re: Orc v2 Ideas

Gopal Vijayaraghavan Tue, 09 Oct 2018 13:04:59 -0700


>    Zstd with particular settings doesn’t work well on one particular 
> non-public dataset after it is encoded by RLE. 
>    I’ve suggested that you try tuning the zstd compression to find a set of 
> parameters that work well with RLE. Take a look at how we tune the zlib 
> compression based on the type of the stream and column.


We've had an almost entirely similar discussion for Zlib when comparing against 
SNAPPY before - we don't use the same Zlib variant for all columns.

ZStd has similar variants which are well suited for different streams of data - 
for example using btlazy2.

Decompression performance was the biggest concern that came up in those 
discussions, so there is a 2-flag combo (encoding.strategy and 
compression.strategy).

Both are set to SPEED right now, because that's what most people want out of 
ORC data - but if the goals are different, then those flags should translate 
into Zstd strategies (the strategies don't need to be recorded in the binaries, 
unlike dictionaries).

An efficient literal representation for Zstd is definitely something to 
consider - I haven't dug into because I'm currently missing a tool like 
"infgen" for Zstd to walk through the hex.

Cheers,
Gopal

Re: Orc v2 Ideas

Reply via email to