> Zstd with particular settings doesn’t work well on one particular > non-public dataset after it is encoded by RLE. > I’ve suggested that you try tuning the zstd compression to find a set of > parameters that work well with RLE. Take a look at how we tune the zlib > compression based on the type of the stream and column.
We've had an almost entirely similar discussion for Zlib when comparing against SNAPPY before - we don't use the same Zlib variant for all columns. ZStd has similar variants which are well suited for different streams of data - for example using btlazy2. Decompression performance was the biggest concern that came up in those discussions, so there is a 2-flag combo (encoding.strategy and compression.strategy). Both are set to SPEED right now, because that's what most people want out of ORC data - but if the goals are different, then those flags should translate into Zstd strategies (the strategies don't need to be recorded in the binaries, unlike dictionaries). An efficient literal representation for Zstd is definitely something to consider - I haven't dug into because I'm currently missing a tool like "infgen" for Zstd to walk through the hex. Cheers, Gopal