k8ika0s commented on PR #48205:
URL: https://github.com/apache/arrow/pull/48205#issuecomment-3568405152

   @Vishwanatha-HD
   
   Working through this one, I’m reminded how many odd little corners show up 
when Arrow’s layout meets Parquet’s expectations — especially around levels, 
decimals, and the legacy INT96 bits.
   
   Looking at the pieces that overlap with the work I’ve been doing, the 
overall direction makes sense. A few notes from what I’ve seen on real s390x 
hardware:
   
   **• BIT_PACKED level headers**
   Your patch keeps the `data_size - 4` bound under `ARROW_LITTLE_ENDIAN`, 
whereas my tree leans on accepting the full BIT_PACKED buffer and logging 
failures rather than early-bounding. Neither approach is wrong, but on BE 
machines I’ve found that the “minus 4” guard sometimes rejects buffers that are 
actually fine, depending on how many values the upstream encoder produced.
   
   **• Decimal serialization**
   This is one of the trickier spots. Parquet expects decimals in a big-endian 
128-bit payload, but Arrow materializes them in little-endian limbs even on BE 
hardware. In my implementation I reverse the Arrow words (`[low, high]` → 
`[high, low]`) before handing them to the writer so the final byte stream 
matches the canonical Parquet format.
   Your patch uses `ToBigEndian` on each limb directly in host order, which 
works for many cases but can produce a differently ordered representation when 
Arrow’s in-memory layout doesn’t match the 128-bit big-endian wire format. Just 
sharing that in case you’ve seen similar behavior when mixing different decimal 
widths.
   
   **• Half-floats in FLBA**
   The BE path you added with `ToLittleEndian(values[i])` aligns with the 
intent. I ended up staging the FLBA structs and the 2-byte payloads together in 
one scratch buffer, mostly because some downstream consumers treat the pointer 
lifetime very strictly. Either way, normalizing those 2-byte halves before page 
assembly helps avoid the cross-architecture drift I’ve run into.
   
   **• Paging / DoInBatches**
   Your rewrite to enforce `max_rows_per_page` is a meaningful cleanup. My 
patches didn’t touch this area, so no conflicts there — but just to mention it, 
keeping the paging logic predictable on BE made debugging the level stream 
quite a bit easier for me.
   
   **• INT96 (Impala timestamp)**
   Your implementation writes host-order limbs on BE and memcpy on LE. In my 
case I leaned heavily on always emitting LE limbs so the decode path doesn’t 
have to branch on architecture. Both approaches work as long as the 
corresponding reader expects the same convention.
   
   None of this is blocking — just trying to pass along the details I’ve seen 
crop up when running the full parquet-encode → parquet-decode cycle on 
big-endian hardware. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to