Re: [DISCUSS] GH-3083: Make DELTA_LENGTH_BYTE_ARRAY default encoding for binary values

Raphael Taylor-Davies Thu, 28 Nov 2024 10:26:46 -0800

For what it is worth this performance disparity may not be a property ofthe encoding but instead the Java implementation. At least in arrow-rsDELTA_LENGTH_BYTE_ARRAY is ~30% slower than PLAIN when reading data frommemory. Given the non-SIMD friendly way it encodes the lengthinformation I would indeed expect it to be slower. Skipping over valuessimilarly does not show major performance differences, as neitherencoding actually provides efficient random lookup, in both casesrequiring scanning through either N values or N lengths.

Now I am not familiar with the Trino benchmark referred to, and it maybe taking IO into account which would be impacted by overall data size,but I thought I'd provide another data point.

I'd also add that many modern engines, e.g. DuckDB and Velox, use astring encoding that avoids needing to copy the string data even whenthe data is PLAIN encoded, and all arrow readers supporting theViewArray types can perform the same optimisation. Arrow-rs does this,however, in the benchmark I was running the strings were relativelyshort ~43 bytes and so the 30% performance hit ofDELTA_LENGTH_BYTE_ARRAY remained unchanged.


Kind Regards,

Raphael Taylor-Davies

On 28/11/2024 17:23, Raunaq Morarka wrote:

The current default for V1 pages is PLAIN encoding. This encoding mixes
string length with string data. This is inefficient for skipping N values,
as the encoding does not allow random access. It's also slow to decode as
the interleaving of lengths with data does not allow efficient batched
implementations and forces most implementations to make copies of the data
to fit the usual representation of separate offsets and data for strings.

DELTA_LENGTH_BYTE_ARRAY has none of the above problems as it separates
offsets and data. The parquet-format spec also seems to recommend this
https://github.com/apache/parquet-format/blob/c70281359087dfaee8bd43bed9748675f4aabe11/Encodings.md?plain=1#L299

### Delta-length byte array: (DELTA_LENGTH_BYTE_ARRAY = 6)

Supported Types: BYTE_ARRAY

This encoding is always preferred over PLAIN for byte array columns.

V2 pages use DELTA_BYTE_ARRAY as the default encoding, this is an
improvement over PLAIN but adds complexity which makes it slower to decode
than DELTA_LENGTH_BYTE_ARRAY with the potential benefit of lower storage
requirements.

JMH benchmarks in Trino's parquet reader at
io.trino.parquet.reader.BenchmarkBinaryColumnReader showed that
DELTA_LENGTH_BYTE_ARRAY can be decoded at over 5X speed and
DELTA_BYTE_ARRAY at over 2X the speed of decoding PLAIN encoding.
Given the above recommendation of parquet-format spec and significant
performance difference, I'm proposing updating parquet-java to use
DELTA_LENGTH_BYTE_ARRAY instead of PLAIN by default for V1 pages.

Re: [DISCUSS] GH-3083: Make DELTA_LENGTH_BYTE_ARRAY default encoding for binary values

Reply via email to