vigneshsiva11 commented on issue #9247:
URL: https://github.com/apache/arrow-rs/issues/9247#issuecomment-4018960578

   Hi @alamb, I’d like to work on this issue.
   
   From looking into the failure, it seems the problem happens during the 
Parquet roundtrip when rebuilding a `DictionaryArray` whose values are a 
`FixedSizeBinaryArray`. The error suggests that the resulting 
`FixedSizeBinaryArray` ends up with two buffers, even though `FixedSizeBinary` 
should only have a single data buffer.
   
   My current guess is that somewhere in the Parquet → Arrow decoding path, 
`FixedSizeBinary` is being handled similarly to variable-length binary types 
(`Binary` or `LargeBinary`), which expect an offset buffer. However, since 
`FixedSizeBinary` has a fixed width, it should only use the data buffer.
   
   My plan is to:
   
   1. Trace how the array is reconstructed during the Parquet roundtrip when 
dictionary encoding is used.
   2. Locate where the `ArrayData` for `FixedSizeBinary` is created and check 
how the buffers are assigned.
   3. Make sure `FixedSizeBinary` arrays are built with only the data buffer 
(no offset buffer).
   4. Add a regression test using a `DictionaryArray` with `FixedSizeBinary` 
values to confirm the roundtrip works correctly.
   
   Does this approach sound reasonable, or is there a specific part of the 
codebase you’d recommend focusing on? Happy to adjust based on feedback.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to