>. While everything supported reading parquet files with the default parquet-java settings, the support was significantly less complete
Agreed, that was one of my motivations to push for this change in default in parquet-java, as that would be a better guarantee that using DELTA_LENGTH_BYTE_ARRAY in V1 page for BINARY would work more widely than just being in the parquet spec. I already found in my testing that such files are readable by Trino, Apache Hive and parquet cli but not by Apache Spark 3.4.2 I filed https://issues.apache.org/jira/browse/SPARK-50457 for it. This is supposed to be supported by Spark since https://issues.apache.org/jira/browse/SPARK-40128 On Fri, 29 Nov 2024 at 16:21, Andrew Lamb <andrewlam...@gmail.com> wrote: > > Does anything in the parquet format spec speak against using > DELTA_LENGTH_BYTE_ARRAY in V1 pages ? I didn't find anything in the spec to > confirm or deny this. > > My biggest suggestion when looking to use an encoding other than PLAIN is > to consider the ecosystem support. When working on InfluxDB 3.0 we found > that support across various tools that read parquet varied/ > > While everything supported reading parquet files with the default > parquet-java settings, the support was significantly less complete. More > details on this thread[1]. > > [1]: https://lists.apache.org/thread/tnxbykozo5owq2y36nw7lomr91hrdxhz > > On Thu, Nov 28, 2024 at 2:22 PM Raunaq Morarka <raunaqmora...@gmail.com> > wrote: > > > > Skipping over values > > similarly does not show major performance differences, as neither > > encoding actually provides efficient random lookup, in both cases > > requiring scanning through either N values or N lengths. > > > > With DELTA_LENGTH_BYTE_ARRAY, you can decode the lengths in a batch and > > compute the offset to skip to in the strings portion directly from that. > > See > > > > > https://github.com/trinodb/trino/blob/ae7d300f024d4f415dfc5f6f63d387418844a172/lib/trino-parquet/src/main/java/io/trino/parquet/reader/decoders/DeltaLengthByteArrayDecoders.java#L135 > > for example. > > With PLAIN we're going back forth between decoding an int and using that > to > > skip to the next encoded length position. > > > > > Given the non-SIMD friendly way it encodes the length information I > would > > indeed expect it to be slower. > > > > While the encoding of the length values is not straightforward, it's > still > > possible to write batched routines for decoding bit-packed integers > > followed by prefix sum. > > E.g. > > > > > https://github.com/trinodb/trino/blob/ae7d300f024d4f415dfc5f6f63d387418844a172/lib/trino-parquet/src/main/java/io/trino/parquet/reader/decoders/DeltaPackingUtils.java#L52 > > > > > Now I am not familiar with the Trino benchmark referred to, and it may > > be taking IO into account which would be impacted by overall data size > > > > The benchmark at > > > > > https://github.com/trinodb/trino/blob/7c6157343583bdad8933a899f7e8fd3ecd4de1a9/lib/trino-parquet/src/test/java/io/trino/parquet/reader/BenchmarkBinaryColumnReader.java#L44 > > is not doing any IO. > > It generates the data for testing in-memory and runs the decoder on it. > It > > does not use any compression as well. > > > > I can see that the performance may come down to implementation details, > > especially whether there is the ability to avoid copies of the string for > > PLAIN encoding. > > Maybe it makes sense for Trino to change the default for its own writer. > > Does anything in the parquet format spec speak against using > > DELTA_LENGTH_BYTE_ARRAY in V1 pages ? I didn't find anything in the spec > to > > confirm or deny this. > > > > On Thu, 28 Nov 2024 at 23:56, Raphael Taylor-Davies > > <r.taylordav...@googlemail.com.invalid> wrote: > > > > > For what it is worth this performance disparity may not be a property > of > > > the encoding but instead the Java implementation. At least in arrow-rs > > > DELTA_LENGTH_BYTE_ARRAY is ~30% slower than PLAIN when reading data > from > > > memory. Given the non-SIMD friendly way it encodes the length > > > information I would indeed expect it to be slower. Skipping over values > > > similarly does not show major performance differences, as neither > > > encoding actually provides efficient random lookup, in both cases > > > requiring scanning through either N values or N lengths. > > > > > > Now I am not familiar with the Trino benchmark referred to, and it may > > > be taking IO into account which would be impacted by overall data size, > > > but I thought I'd provide another data point. > > > > > > I'd also add that many modern engines, e.g. DuckDB and Velox, use a > > > string encoding that avoids needing to copy the string data even when > > > the data is PLAIN encoded, and all arrow readers supporting the > > > ViewArray types can perform the same optimisation. Arrow-rs does this, > > > however, in the benchmark I was running the strings were relatively > > > short ~43 bytes and so the 30% performance hit of > > > DELTA_LENGTH_BYTE_ARRAY remained unchanged. > > > > > > Kind Regards, > > > > > > Raphael Taylor-Davies > > > > > > On 28/11/2024 17:23, Raunaq Morarka wrote: > > > > The current default for V1 pages is PLAIN encoding. This encoding > mixes > > > > string length with string data. This is inefficient for skipping N > > > values, > > > > as the encoding does not allow random access. It's also slow to > decode > > as > > > > the interleaving of lengths with data does not allow efficient > batched > > > > implementations and forces most implementations to make copies of the > > > data > > > > to fit the usual representation of separate offsets and data for > > strings. > > > > > > > > DELTA_LENGTH_BYTE_ARRAY has none of the above problems as it > separates > > > > offsets and data. The parquet-format spec also seems to recommend > this > > > > > > > > > > https://github.com/apache/parquet-format/blob/c70281359087dfaee8bd43bed9748675f4aabe11/Encodings.md?plain=1#L299 > > > > > > > > ### Delta-length byte array: (DELTA_LENGTH_BYTE_ARRAY = 6) > > > > > > > > Supported Types: BYTE_ARRAY > > > > > > > > This encoding is always preferred over PLAIN for byte array columns. > > > > > > > > V2 pages use DELTA_BYTE_ARRAY as the default encoding, this is an > > > > improvement over PLAIN but adds complexity which makes it slower to > > > decode > > > > than DELTA_LENGTH_BYTE_ARRAY with the potential benefit of lower > > storage > > > > requirements. > > > > > > > > JMH benchmarks in Trino's parquet reader at > > > > io.trino.parquet.reader.BenchmarkBinaryColumnReader showed that > > > > DELTA_LENGTH_BYTE_ARRAY can be decoded at over 5X speed and > > > > DELTA_BYTE_ARRAY at over 2X the speed of decoding PLAIN encoding. > > > > Given the above recommendation of parquet-format spec and significant > > > > performance difference, I'm proposing updating parquet-java to use > > > > DELTA_LENGTH_BYTE_ARRAY instead of PLAIN by default for V1 pages. > > > > > > > > > > > > > -- > > Regards, > > Raunaq Morarka, > > Email: raunaqmora...@gmail.com > > > -- Regards, Raunaq Morarka, Email: raunaqmora...@gmail.com