alamb commented on code in PR #9971:
URL: https://github.com/apache/arrow-datafusion/pull/9971#discussion_r1554556162


##########
datafusion/functions/src/string/lower.rs:
##########
@@ -62,6 +62,6 @@ impl ScalarUDFImpl for LowerFunc {
     }
 
     fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
-        handle(args, |string| string.to_lowercase(), "lower")
+        case_conversion(args, |string| string.to_lowercase(), "lower")

Review Comment:
   It seems like to_lowercase actually works on unicode values, so it may 
change the length of the strings: 
https://doc.rust-lang.org/std/primitive.str.html#method.to_lowercase



##########
datafusion/functions/src/string/common.rs:
##########
@@ -103,6 +104,7 @@ pub(crate) fn general_trim<T: OffsetSizeTrait>(
 /// This function errors when:
 /// * the number of arguments is not 1
 /// * the first argument is not castable to a `GenericStringArray`
+#[allow(dead_code)]

Review Comment:
   Are these functions still needed? Can we instead simply delete them if they 
are no longer used (since they are crate private)?
   
   They don't appear to be used anywhere else: 
https://github.com/search?q=repo%3Aapache%2Farrow-datafusion%20unary_string_function&type=code



##########
datafusion/functions/src/string/upper.rs:
##########
@@ -59,6 +59,49 @@ impl ScalarUDFImpl for UpperFunc {
     }
 
     fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
-        handle(args, |string| string.to_uppercase(), "upper")
+        case_conversion(args, |string| string.to_uppercase(), "upper")
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use arrow::array::{ArrayRef, StringArray};
+    use std::sync::Arc;
+
+    #[test]
+    fn upper() -> Result<()> {
+        let string_array = Arc::new(StringArray::from(vec![
+            Some("arrow"),
+            None,
+            Some("datafusion"),

Review Comment:
   Can you please also add tests for non ascii characters whose upper/lowercase 
lengths are different?
   
   It seems like this is possible
   
https://stackoverflow.com/questions/48351343/can-lowercasing-a-utf-8-string-cause-it-to-grow



##########
datafusion/functions/src/string/common.rs:
##########
@@ -174,3 +177,62 @@ where
         },
     }
 }
+
+pub(crate) fn case_conversion<'a, F>(
+    args: &'a [ColumnarValue],
+    op: F,
+    name: &str,
+) -> Result<ColumnarValue>
+where
+    F: Fn(&'a str) -> String,
+{
+    match &args[0] {
+        ColumnarValue::Array(array) => match array.data_type() {
+            DataType::Utf8 => {
+                Ok(ColumnarValue::Array(convert_array::<i32, _>(array, op)?))
+            }
+            DataType::LargeUtf8 => {
+                Ok(ColumnarValue::Array(convert_array::<i64, _>(array, op)?))
+            }
+            other => exec_err!("Unsupported data type {other:?} for function 
{name}"),
+        },
+        ColumnarValue::Scalar(scalar) => match scalar {
+            ScalarValue::Utf8(a) => {
+                let result = a.as_ref().map(|x| op(x));
+                Ok(ColumnarValue::Scalar(ScalarValue::Utf8(result)))
+            }
+            ScalarValue::LargeUtf8(a) => {
+                let result = a.as_ref().map(|x| op(x));
+                Ok(ColumnarValue::Scalar(ScalarValue::LargeUtf8(result)))
+            }
+            other => exec_err!("Unsupported data type {other:?} for function 
{name}"),
+        },
+    }
+}
+
+fn convert_array<'a, O, F>(array: &'a ArrayRef, op: F) -> Result<ArrayRef>
+where
+    O: OffsetSizeTrait,
+    F: Fn(&'a str) -> String,
+{
+    let string_array = as_generic_string_array::<O>(array)?;
+    let value_data = string_array.value_data();
+
+    // SAFETY: all items stored in value_data satisfy UTF8.
+    // ref: impl ByteArrayNativeType for str {...}
+    let str_values = unsafe { std::str::from_utf8_unchecked(value_data) };
+
+    // conversion
+    let converted_values = op(str_values);
+    let bytes = converted_values.into_bytes();
+
+    // build result
+    let values = Buffer::from_vec(bytes);
+    let offsets = string_array.offsets().clone();
+    let nulls = string_array.nulls().cloned();
+
+    // SAFETY: offsets and nulls are consistent with the input array.

Review Comment:
   I think the safety of this also relies on `op` not decreasing the length of 
the characters, as in that case the offsets would be pointing into invalid 
characters. For example if `op()` always returned `""` I think this would 
result in an invalid array.
   
   Can you please also add a sanity check that the the new string didn't get 
smaller?
   
   Something like (untested) this above:
   ```rust
       let orig_len = str_values.len()
       let converted_values = op(str_values);
       // offsets may refer to anywhere in str_values so new string must be at 
least as long
       assert!(converted_values.len() >= orig_len);



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to