etseidl commented on PR #8258: URL: https://github.com/apache/arrow-rs/pull/8258#issuecomment-3398657635
> I have no understanding about the implications of this change (e.g impact on output size, compatibility, etc) > > Could someone explain that? When a dictionary for a given column chunk gets too large, the parquet encoders will fall back to some other encoding for subsequent pages. For V1, the fallback is the plain encoder. When the delta encodings were added, the entry for `DELTA_LENGTH_BYTE_ARRAY` stated that this encoding is always prefered over `PLAIN` for byte array columns ([link](https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6)). Many writers then changed their fallback behavior when writing V2 pages, but rather than the recommended `DELTA_LENGTH_BYTE_ARRAY` they all seemed to favor `DELTA_BYTE_ARRAY`. The issue with `DELTA_BYTE_ARRAY` (which uses front compression) is that when you have unsorted data, you wind up with very short to non-existent prefixes, so it winds up being a slower `DELTA_LENGTH_BYTE_ARRAY`. @mapleFU is suggesting we change this behavior and fall back to the faster encoder. Users can override encoders per-column if they have sorted data. > > A ticket would be an ideal location I'll try to get an issue started on this. I'm going to mark this as draft in the meantime. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
