vegarsti opened a new pull request, #8589:
URL: https://github.com/apache/arrow-rs/pull/8589

   # Which issue does this PR close?
   
   - Contribues towards the RunEndEncoded (REE) epic #3520, but no specific 
issue for casting.
   
   # Rationale for this change
   
   This PR implements casting support for RunEndEncoded arrays in Apache Arrow. 
RunEndEncoded is a compression format that stores consecutive runs of equal 
values efficiently, but previously lacked casting functionality. This change 
enables:
   
   1. Casting FROM RunEndEncoded arrays - Converting the values within a 
RunEndEncoded array to different data types while preserving the run structure
   2. Casting TO RunEndEncoded arrays - Converting regular arrays into 
RunEndEncoded format by performing run-end encoding
   3. Full integration with Arrow's casting system - Making RunEndEncoded 
arrays work with the existing cast() and can_cast_types() functions
   
   ## Tradeoffs and Implementation
   The implementation of REE array casting introduced a critical tradeoff 
between user flexibility and data integrity.
   
   Unlike most Arrow types, REE arrays have a fundamental monotonicity 
constraint: their run-end indices must be strictly increasing to preserve 
logical correctness. Silent truncation or wrapping during downcasts (e.g., 
Int64 → Int16) could produce invalid sequences like:
   
   `[1000, -15536, 14464]` // due to overflow Such sequences break the REE 
invariant and could cause panics or silent data corruption downstream.
   
   This would violate the REE invariant and may cause panics or silent data 
corruption downstream. Arrow’s `CastOptions` normally allow `safe = false` to 
skip overflow checks and instead produce nulls. However, for REE arrays, such 
behavior is unsafe and incompatible with the strict invariants required by 
run-end indices.
   
   We chose to hard-code `safe:True` behavior for run-end casting.
   
   This ensures that:
   
   * Any attempt to cast run-end indices to a narrower integer type will fail 
immediately if it would result in overflow — even when safe = false is set by 
the user  
   * Narrowing conversions (e.g., from Int64 to Int16) will always fail if any 
values exceed the target type’s bounds — even if the user explicitly sets safe 
= false
   * Upcasts (e.g., Int16 → Int32 -> Int64) are allowed, as they are lossless. 
   * Widening conversions (e.g., from Int16 to Int64) are allowed, as they are 
inherently lossless
   
   **This policy protects the logical soundness of REE arrays and maintains 
integrity across the Arrow ecosystem.**
   
   # What changes are included in this PR?
   Users can now cast RunEndEncoded arrays using the standard 
arrow_cast::cast() function
   1. `run_end_encoded_cast()`: Casts values within existing RunEndEncoded 
arrays to different types
   2. `cast_to_run_end_encoded()`: Converts regular arrays to RunEndEncoded 
format with run-end encoding
   3. Updated `can_cast_types()` to support RunEndEncoded compatibility rules. 
Run_End down casting is not allowed.
   
   # Are these changes tested?
   Yes!
   
   # Are there any user-facing changes?
   
   No breaking changes, just new functionality


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to