etseidl commented on PR #8258:
URL: https://github.com/apache/arrow-rs/pull/8258#issuecomment-3398657635

   
   
   
   > I have no understanding about the implications of this change (e.g impact 
on output size, compatibility, etc)
   > 
   > Could someone explain that?
   
   When a dictionary for a given column chunk gets too large, the parquet 
encoders will fall back to some other encoding for subsequent pages. For V1, 
the fallback is the plain encoder. When the delta encodings were added, the 
entry for `DELTA_LENGTH_BYTE_ARRAY` stated that this encoding is always 
prefered over `PLAIN` for byte array columns 
([link](https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6)).
 Many writers then changed their fallback behavior when writing V2 pages, but 
rather than the recommended `DELTA_LENGTH_BYTE_ARRAY` they all seemed to favor 
`DELTA_BYTE_ARRAY`. The issue with `DELTA_BYTE_ARRAY` (which uses front 
compression) is that when you have unsorted data, you wind up with very short 
to non-existent prefixes, so it winds up being a slower 
`DELTA_LENGTH_BYTE_ARRAY`.
   
   @mapleFU is suggesting we change this behavior and fall back to the faster 
encoder. Users can override encoders per-column if they have sorted data.
   
   > 
   > A ticket would be an ideal location
   
   I'll try to get an issue started on this. I'm going to mark this as draft in 
the meantime.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to