Re: [DISCUSS] GH-3083: Make DELTA_LENGTH_BYTE_ARRAY default encoding for binary values

Raunaq Morarka Fri, 29 Nov 2024 03:56:54 -0800

>. While everything supported reading parquet files with the default
parquet-java settings, the support was significantly less complete


Agreed, that was one of my motivations to push for this change in default
in parquet-java, as that would be a better guarantee that
using DELTA_LENGTH_BYTE_ARRAY in V1 page for BINARY would work more widely
than just being in the parquet spec.
I already found in my testing that such files are readable by Trino, Apache
Hive and parquet cli but not by Apache Spark 3.4.2
I filed https://issues.apache.org/jira/browse/SPARK-50457 for it. This is
supposed to be supported by Spark since
https://issues.apache.org/jira/browse/SPARK-40128

On Fri, 29 Nov 2024 at 16:21, Andrew Lamb <andrewlam...@gmail.com> wrote:

> > Does anything in the parquet format spec speak against using
> DELTA_LENGTH_BYTE_ARRAY in V1 pages ? I didn't find anything in the spec to
> confirm or deny this.
>
> My biggest suggestion when looking to use an encoding other than PLAIN is
> to consider the ecosystem support. When working on InfluxDB 3.0 we found
> that support across various tools that read parquet varied/
>
> While everything supported reading parquet files with the default
> parquet-java settings, the support was significantly less complete. More
> details on this thread[1].
>
> [1]: https://lists.apache.org/thread/tnxbykozo5owq2y36nw7lomr91hrdxhz
>
> On Thu, Nov 28, 2024 at 2:22 PM Raunaq Morarka <raunaqmora...@gmail.com>
> wrote:
>
> > >  Skipping over values
> >     similarly does not show major performance differences, as neither
> >     encoding actually provides efficient random lookup, in both cases
> >     requiring scanning through either N values or N lengths.
> >
> > With DELTA_LENGTH_BYTE_ARRAY, you can decode the lengths in a batch and
> > compute the offset to skip to in the strings portion directly from that.
> > See
> >
> >
> https://github.com/trinodb/trino/blob/ae7d300f024d4f415dfc5f6f63d387418844a172/lib/trino-parquet/src/main/java/io/trino/parquet/reader/decoders/DeltaLengthByteArrayDecoders.java#L135
> > for example.
> > With PLAIN we're going back forth between decoding an int and using that
> to
> > skip to the next encoded length position.
> >
> > > Given the non-SIMD friendly way it encodes the length information I
> would
> > indeed expect it to be slower.
> >
> > While the encoding of the length values is not straightforward, it's
> still
> > possible to write batched routines for decoding bit-packed integers
> > followed by prefix sum.
> > E.g.
> >
> >
> https://github.com/trinodb/trino/blob/ae7d300f024d4f415dfc5f6f63d387418844a172/lib/trino-parquet/src/main/java/io/trino/parquet/reader/decoders/DeltaPackingUtils.java#L52
> >
> > > Now I am not familiar with the Trino benchmark referred to, and it may
> > be taking IO into account which would be impacted by overall data size
> >
> > The benchmark at
> >
> >
> https://github.com/trinodb/trino/blob/7c6157343583bdad8933a899f7e8fd3ecd4de1a9/lib/trino-parquet/src/test/java/io/trino/parquet/reader/BenchmarkBinaryColumnReader.java#L44
> > is not doing any IO.
> > It generates the data for testing in-memory and runs the decoder on it.
> It
> > does not use any compression as well.
> >
> > I can see that the performance may come down to implementation details,
> > especially whether there is the ability to avoid copies of the string for
> > PLAIN encoding.
> > Maybe it makes sense for Trino to change the default for its own writer.
> > Does anything in the parquet format spec speak against using
> > DELTA_LENGTH_BYTE_ARRAY in V1 pages ? I didn't find anything in the spec
> to
> > confirm or deny this.
> >
> > On Thu, 28 Nov 2024 at 23:56, Raphael Taylor-Davies
> > <r.taylordav...@googlemail.com.invalid> wrote:
> >
> > > For what it is worth this performance disparity may not be a property
> of
> > > the encoding but instead the Java implementation. At least in arrow-rs
> > > DELTA_LENGTH_BYTE_ARRAY is ~30% slower than PLAIN when reading data
> from
> > > memory. Given the non-SIMD friendly way it encodes the length
> > > information I would indeed expect it to be slower. Skipping over values
> > > similarly does not show major performance differences, as neither
> > > encoding actually provides efficient random lookup, in both cases
> > > requiring scanning through either N values or N lengths.
> > >
> > > Now I am not familiar with the Trino benchmark referred to, and it may
> > > be taking IO into account which would be impacted by overall data size,
> > > but I thought I'd provide another data point.
> > >
> > > I'd also add that many modern engines, e.g. DuckDB and Velox, use a
> > > string encoding that avoids needing to copy the string data even when
> > > the data is PLAIN encoded, and all arrow readers supporting the
> > > ViewArray types can perform the same optimisation. Arrow-rs does this,
> > > however, in the benchmark I was running the strings were relatively
> > > short ~43 bytes and so the 30% performance hit of
> > > DELTA_LENGTH_BYTE_ARRAY remained unchanged.
> > >
> > > Kind Regards,
> > >
> > > Raphael Taylor-Davies
> > >
> > > On 28/11/2024 17:23, Raunaq Morarka wrote:
> > > > The current default for V1 pages is PLAIN encoding. This encoding
> mixes
> > > > string length with string data. This is inefficient for skipping N
> > > values,
> > > > as the encoding does not allow random access. It's also slow to
> decode
> > as
> > > > the interleaving of lengths with data does not allow efficient
> batched
> > > > implementations and forces most implementations to make copies of the
> > > data
> > > > to fit the usual representation of separate offsets and data for
> > strings.
> > > >
> > > > DELTA_LENGTH_BYTE_ARRAY has none of the above problems as it
> separates
> > > > offsets and data. The parquet-format spec also seems to recommend
> this
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/c70281359087dfaee8bd43bed9748675f4aabe11/Encodings.md?plain=1#L299
> > > >
> > > > ### Delta-length byte array: (DELTA_LENGTH_BYTE_ARRAY = 6)
> > > >
> > > > Supported Types: BYTE_ARRAY
> > > >
> > > > This encoding is always preferred over PLAIN for byte array columns.
> > > >
> > > > V2 pages use DELTA_BYTE_ARRAY as the default encoding, this is an
> > > > improvement over PLAIN but adds complexity which makes it slower to
> > > decode
> > > > than DELTA_LENGTH_BYTE_ARRAY with the potential benefit of lower
> > storage
> > > > requirements.
> > > >
> > > > JMH benchmarks in Trino's parquet reader at
> > > > io.trino.parquet.reader.BenchmarkBinaryColumnReader showed that
> > > > DELTA_LENGTH_BYTE_ARRAY can be decoded at over 5X speed and
> > > > DELTA_BYTE_ARRAY at over 2X the speed of decoding PLAIN encoding.
> > > > Given the above recommendation of parquet-format spec and significant
> > > > performance difference, I'm proposing updating parquet-java to use
> > > > DELTA_LENGTH_BYTE_ARRAY instead of PLAIN by default for V1 pages.
> > > >
> > >
> >
> >
> > --
> > Regards,
> > Raunaq Morarka,
> > Email: raunaqmora...@gmail.com
> >
>


-- 
Regards,
Raunaq Morarka,
Email: raunaqmora...@gmail.com

Re: [DISCUSS] GH-3083: Make DELTA_LENGTH_BYTE_ARRAY default encoding for binary values

Reply via email to