rich-t-kid-datadog commented on code in PR #7713:
URL: https://github.com/apache/arrow-rs/pull/7713#discussion_r2161826579


##########
arrow-cast/src/cast/run_array.rs:
##########
@@ -0,0 +1,152 @@
+use crate::cast::*;
+
+pub(crate) fn run_end_encoded_cast<K: RunEndIndexType>(
+    array: &dyn Array,
+    to_type: &DataType,
+    cast_options: &CastOptions,
+) -> Result<ArrayRef, ArrowError> {
+    match array.data_type() {
+        DataType::RunEndEncoded(_run_end_field, _values_field) => {
+            let run_array = array
+                .as_any()
+                .downcast_ref::<RunArray<K>>()
+                .ok_or_else(|| ArrowError::CastError("Expected 
RunArray".to_string()))?;
+
+            let values = run_array.values();
+
+            match to_type {
+                // CASE 1: Stay as RunEndEncoded, cast only the values
+                DataType::RunEndEncoded(_target_run_end_field, 
target_value_field) => {

Review Comment:
   actually the first point I made doesnt really hold true since` let 
target_type = DataType::RunEndEncoded(
                   Arc::new(Field::new("run_ends", DataType::Int32, false)),
                   Arc::new(Field::new("values", DataType::Utf8, true)),
               );` is possible but I think the silent wrapping of the run_ends 
buffer shouldn't be allowed



##########
arrow-cast/src/cast/run_array.rs:
##########
@@ -0,0 +1,152 @@
+use crate::cast::*;
+
+pub(crate) fn run_end_encoded_cast<K: RunEndIndexType>(
+    array: &dyn Array,
+    to_type: &DataType,
+    cast_options: &CastOptions,
+) -> Result<ArrayRef, ArrowError> {
+    match array.data_type() {
+        DataType::RunEndEncoded(_run_end_field, _values_field) => {
+            let run_array = array
+                .as_any()
+                .downcast_ref::<RunArray<K>>()
+                .ok_or_else(|| ArrowError::CastError("Expected 
RunArray".to_string()))?;
+
+            let values = run_array.values();
+
+            match to_type {
+                // CASE 1: Stay as RunEndEncoded, cast only the values
+                DataType::RunEndEncoded(_target_run_end_field, 
target_value_field) => {

Review Comment:
   For example
   `run_ends: [1000, 50000, 80000, 1000000]  // type: Int64
   ` 
   `let casted_run_ends: Vec<i16> = int64_run_ends.iter().map(|&x| x as 
i16).collect();`
   This would result in 
   `Original: [1000, 50000, 80000, 1000000]
   `
   `Casted (as i16): [1000, -15536, 14464, -31072]`
   Which wouldnt be a valid run_end array. I dont think we should allow clients 
to do this. a solution could be to only allow upcast to higher bit Ints 



##########
arrow-cast/src/cast/run_array.rs:
##########
@@ -0,0 +1,152 @@
+use crate::cast::*;
+
+pub(crate) fn run_end_encoded_cast<K: RunEndIndexType>(
+    array: &dyn Array,
+    to_type: &DataType,
+    cast_options: &CastOptions,
+) -> Result<ArrayRef, ArrowError> {
+    match array.data_type() {
+        DataType::RunEndEncoded(_run_end_field, _values_field) => {
+            let run_array = array
+                .as_any()
+                .downcast_ref::<RunArray<K>>()
+                .ok_or_else(|| ArrowError::CastError("Expected 
RunArray".to_string()))?;
+
+            let values = run_array.values();
+
+            match to_type {
+                // CASE 1: Stay as RunEndEncoded, cast only the values
+                DataType::RunEndEncoded(_target_run_end_field, 
target_value_field) => {

Review Comment:
   I'm not sure we should support this. The main reason is that the current 
Arrow casting API is designed to accept a single target type, and adding 
support for RunEndEncoded casting would make this implicit and potentially 
confusing.For example, when casting to a RunEndEncoded type today, the call 
looks like: `cast(array, DataType::RunEndEncoded(...))` This fits naturally 
within Arrow’s existing cast(array, to_type) interface. However, if we were to 
support casting between different REE index types (e.g., Int64 to Int16), we 
would effectively need to pass two types: the run-end index type and the value 
type. That would either:
     - Require changes to the API signature, or
     - Encode both types in a way that's less readable and harder to validate.
   since there are only three valid run-end index types (Int16, Int32, Int64), 
which are monotonically increasing in capacity, supporting downcasts (Int64 → 
Int16) introduces issues. Due to Rust’s two’s complement truncation during 
integer casting with **as**, attempting to cast a large run-end value from 
Int64 to Int16 would silently wrap, resulting in a serious logical error on the 
client side.
   -----------------------------
   actually the first point I made doesnt really hold true since let 
target_type = DataType::RunEndEncoded( Arc::new(Field::new("run_ends", 
DataType::Int32, false)), Arc::new(Field::new("values", DataType::Utf8, true)), 
); is possible but I think the silent wrapping of the run_ends buffer shouldn't 
be allowed
   -------------------------
   For example
   run_ends: [1000, 50000, 80000, 1000000]  // type: Int64 
   let casted_run_ends: Vec<i16> = int64_run_ends.iter().map(|&x| x as 
i16).collect();
   This would result in
   Original: [1000, 50000, 80000, 1000000] 
   Casted (as i16): [1000, -15536, 14464, -31072]
   Which wouldnt be a valid run_end array. I dont think we should allow clients 
to do this. a solution could be to only allow upcast to higher bit Ints



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to