rich-t-kid-datadog commented on code in PR #7713:
URL: https://github.com/apache/arrow-rs/pull/7713#discussion_r2161826579
##########
arrow-cast/src/cast/run_array.rs:
##########
@@ -0,0 +1,152 @@
+use crate::cast::*;
+
+pub(crate) fn run_end_encoded_cast<K: RunEndIndexType>(
+ array: &dyn Array,
+ to_type: &DataType,
+ cast_options: &CastOptions,
+) -> Result<ArrayRef, ArrowError> {
+ match array.data_type() {
+ DataType::RunEndEncoded(_run_end_field, _values_field) => {
+ let run_array = array
+ .as_any()
+ .downcast_ref::<RunArray<K>>()
+ .ok_or_else(|| ArrowError::CastError("Expected
RunArray".to_string()))?;
+
+ let values = run_array.values();
+
+ match to_type {
+ // CASE 1: Stay as RunEndEncoded, cast only the values
+ DataType::RunEndEncoded(_target_run_end_field,
target_value_field) => {
Review Comment:
actually the first point I made doesnt really hold true since` let
target_type = DataType::RunEndEncoded(
Arc::new(Field::new("run_ends", DataType::Int32, false)),
Arc::new(Field::new("values", DataType::Utf8, true)),
);` is possible but I think the silent wrapping of the run_ends
buffer shouldn't be allowed
##########
arrow-cast/src/cast/run_array.rs:
##########
@@ -0,0 +1,152 @@
+use crate::cast::*;
+
+pub(crate) fn run_end_encoded_cast<K: RunEndIndexType>(
+ array: &dyn Array,
+ to_type: &DataType,
+ cast_options: &CastOptions,
+) -> Result<ArrayRef, ArrowError> {
+ match array.data_type() {
+ DataType::RunEndEncoded(_run_end_field, _values_field) => {
+ let run_array = array
+ .as_any()
+ .downcast_ref::<RunArray<K>>()
+ .ok_or_else(|| ArrowError::CastError("Expected
RunArray".to_string()))?;
+
+ let values = run_array.values();
+
+ match to_type {
+ // CASE 1: Stay as RunEndEncoded, cast only the values
+ DataType::RunEndEncoded(_target_run_end_field,
target_value_field) => {
Review Comment:
For example
`run_ends: [1000, 50000, 80000, 1000000] // type: Int64
`
`let casted_run_ends: Vec<i16> = int64_run_ends.iter().map(|&x| x as
i16).collect();`
This would result in
`Original: [1000, 50000, 80000, 1000000]
`
`Casted (as i16): [1000, -15536, 14464, -31072]`
Which wouldnt be a valid run_end array. I dont think we should allow clients
to do this. a solution could be to only allow upcast to higher bit Ints
##########
arrow-cast/src/cast/run_array.rs:
##########
@@ -0,0 +1,152 @@
+use crate::cast::*;
+
+pub(crate) fn run_end_encoded_cast<K: RunEndIndexType>(
+ array: &dyn Array,
+ to_type: &DataType,
+ cast_options: &CastOptions,
+) -> Result<ArrayRef, ArrowError> {
+ match array.data_type() {
+ DataType::RunEndEncoded(_run_end_field, _values_field) => {
+ let run_array = array
+ .as_any()
+ .downcast_ref::<RunArray<K>>()
+ .ok_or_else(|| ArrowError::CastError("Expected
RunArray".to_string()))?;
+
+ let values = run_array.values();
+
+ match to_type {
+ // CASE 1: Stay as RunEndEncoded, cast only the values
+ DataType::RunEndEncoded(_target_run_end_field,
target_value_field) => {
Review Comment:
I'm not sure we should support this. The main reason is that the current
Arrow casting API is designed to accept a single target type, and adding
support for RunEndEncoded casting would make this implicit and potentially
confusing.For example, when casting to a RunEndEncoded type today, the call
looks like: `cast(array, DataType::RunEndEncoded(...))` This fits naturally
within Arrow’s existing cast(array, to_type) interface. However, if we were to
support casting between different REE index types (e.g., Int64 to Int16), we
would effectively need to pass two types: the run-end index type and the value
type. That would either:
- Require changes to the API signature, or
- Encode both types in a way that's less readable and harder to validate.
since there are only three valid run-end index types (Int16, Int32, Int64),
which are monotonically increasing in capacity, supporting downcasts (Int64 →
Int16) introduces issues. Due to Rust’s two’s complement truncation during
integer casting with **as**, attempting to cast a large run-end value from
Int64 to Int16 would silently wrap, resulting in a serious logical error on the
client side.
-----------------------------
actually the first point I made doesnt really hold true since let
target_type = DataType::RunEndEncoded( Arc::new(Field::new("run_ends",
DataType::Int32, false)), Arc::new(Field::new("values", DataType::Utf8, true)),
); is possible but I think the silent wrapping of the run_ends buffer shouldn't
be allowed
-------------------------
For example
run_ends: [1000, 50000, 80000, 1000000] // type: Int64
let casted_run_ends: Vec<i16> = int64_run_ends.iter().map(|&x| x as
i16).collect();
This would result in
Original: [1000, 50000, 80000, 1000000]
Casted (as i16): [1000, -15536, 14464, -31072]
Which wouldnt be a valid run_end array. I dont think we should allow clients
to do this. a solution could be to only allow upcast to higher bit Ints
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]