vegarsti opened a new pull request, #8589: URL: https://github.com/apache/arrow-rs/pull/8589
# Which issue does this PR close? - Contribues towards the RunEndEncoded (REE) epic #3520, but no specific issue for casting. # Rationale for this change This PR implements casting support for RunEndEncoded arrays in Apache Arrow. RunEndEncoded is a compression format that stores consecutive runs of equal values efficiently, but previously lacked casting functionality. This change enables: 1. Casting FROM RunEndEncoded arrays - Converting the values within a RunEndEncoded array to different data types while preserving the run structure 2. Casting TO RunEndEncoded arrays - Converting regular arrays into RunEndEncoded format by performing run-end encoding 3. Full integration with Arrow's casting system - Making RunEndEncoded arrays work with the existing cast() and can_cast_types() functions ## Tradeoffs and Implementation The implementation of REE array casting introduced a critical tradeoff between user flexibility and data integrity. Unlike most Arrow types, REE arrays have a fundamental monotonicity constraint: their run-end indices must be strictly increasing to preserve logical correctness. Silent truncation or wrapping during downcasts (e.g., Int64 → Int16) could produce invalid sequences like: `[1000, -15536, 14464]` // due to overflow Such sequences break the REE invariant and could cause panics or silent data corruption downstream. This would violate the REE invariant and may cause panics or silent data corruption downstream. Arrow’s `CastOptions` normally allow `safe = false` to skip overflow checks and instead produce nulls. However, for REE arrays, such behavior is unsafe and incompatible with the strict invariants required by run-end indices. We chose to hard-code `safe:True` behavior for run-end casting. This ensures that: * Any attempt to cast run-end indices to a narrower integer type will fail immediately if it would result in overflow — even when safe = false is set by the user * Narrowing conversions (e.g., from Int64 to Int16) will always fail if any values exceed the target type’s bounds — even if the user explicitly sets safe = false * Upcasts (e.g., Int16 → Int32 -> Int64) are allowed, as they are lossless. * Widening conversions (e.g., from Int16 to Int64) are allowed, as they are inherently lossless **This policy protects the logical soundness of REE arrays and maintains integrity across the Arrow ecosystem.** # What changes are included in this PR? Users can now cast RunEndEncoded arrays using the standard arrow_cast::cast() function 1. `run_end_encoded_cast()`: Casts values within existing RunEndEncoded arrays to different types 2. `cast_to_run_end_encoded()`: Converts regular arrays to RunEndEncoded format with run-end encoding 3. Updated `can_cast_types()` to support RunEndEncoded compatibility rules. Run_End down casting is not allowed. # Are these changes tested? Yes! # Are there any user-facing changes? No breaking changes, just new functionality -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
