Re: [PR] fix: Use codepoints in `lpad`, `rpad`, `translate` [datafusion]

via GitHub Tue, 07 Apr 2026 00:27:02 -0700


martin-g commented on code in PR #21405:
URL: https://github.com/apache/datafusion/pull/21405#discussion_r3043352149



##########
datafusion/functions/src/unicode/lpad.rs:
##########
@@ -270,22 +269,19 @@ fn lpad_scalar_unicode<'a, V: StringArrayType<'a> + Copy, 
T: OffsetSizeTrait>(
     let data_capacity = string_array.len().saturating_mul(target_len * 4);
     let mut builder =
         GenericStringBuilder::<T>::with_capacity(string_array.len(), 
data_capacity);
-    let mut graphemes_buf = Vec::new();
 
     for maybe_string in string_array.iter() {
         match maybe_string {
             Some(string) => {
-                graphemes_buf.clear();
-                graphemes_buf.extend(string.graphemes(true));
+                let char_count = string.chars().count();
 
-                if target_len < graphemes_buf.len() {
-                    let end: usize =
-                        graphemes_buf[..target_len].iter().map(|g| 
g.len()).sum();
-                    builder.append_value(&string[..end]);
+                if target_len < char_count {
+                    builder
+                        .append_value(&string[..byte_offset_of_char(string, 
target_len)]);

Review Comment:
   Both `byte_offset_of_char(string, ...)` and `let char_count = 
string.chars().count();` iterate over the chars in `string`.
   It could be optimized to iterate just once with something like:
   ```rust
   let mut char_count = 0;
   let mut truncate_offset = None;
   for (i, (byte_idx, _)) in string.char_indices().enumerate() {
       if i == target_len {
           truncate_offset = Some(byte_idx);
       }
       char_count += 1;
   }
   if let Some(offset) = truncate_offset {
       builder.append_value(&string[..offset]);
       ... 
   ```



##########
datafusion/sqllogictest/test_files/string/string_literal.slt:
##########
@@ -1829,3 +1886,27 @@ query T
 SELECT arrow_typeof(translate(arrow_cast('12345', 'Utf8View'), '143', 'ax'))
 ----
 Utf8View
+
+# translate operates on Unicode codepoints, not grapheme clusters.
+# chr(769) is U+0301 COMBINING ACUTE ACCENT.
+
+# Replacing a combining accent (a single codepoint) with another character.
+# 'e' || chr(769) is 2 codepoints; translating chr(769) → 'X' replaces just 
the accent.
+query B
+SELECT translate('e' || chr(769), chr(769), 'X') = 'eX'
+----
+true
+
+# Replacing the base character but not the combining accent.
+query B
+SELECT translate('e' || chr(769) || 'y', 'e', 'a') = 'a' || chr(769) || 'y'
+----
+true
+
+# Deleting a combining accent (from longer than to).
+# 'e' || chr(769) || 'x' with chr(769) in from but no corresponding to entry → 
deleted.

Review Comment:
   ```suggestion
   # 'e' || chr(769) || 'x' with chr(769) in `from` but no corresponding `to` 
entry → deleted.
   ```



##########
datafusion/functions/src/unicode/rpad.rs:
##########
@@ -271,22 +270,19 @@ fn rpad_scalar_unicode<'a, V: StringArrayType<'a> + Copy, 
T: OffsetSizeTrait>(
     let data_capacity = string_array.len().saturating_mul(target_len * 4);
     let mut builder =
         GenericStringBuilder::<T>::with_capacity(string_array.len(), 
data_capacity);
-    let mut graphemes_buf = Vec::new();
 
     for maybe_string in string_array.iter() {
         match maybe_string {
             Some(string) => {
-                graphemes_buf.clear();
-                graphemes_buf.extend(string.graphemes(true));
+                let char_count = string.chars().count();
 
-                if target_len < graphemes_buf.len() {
-                    let end: usize =
-                        graphemes_buf[..target_len].iter().map(|g| 
g.len()).sum();
-                    builder.append_value(&string[..end]);
+                if target_len < char_count {
+                    builder
+                        .append_value(&string[..byte_offset_of_char(string, 
target_len)]);

Review Comment:
   Similar to lpad - two passes over the string when truncation is needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: Use codepoints in `lpad`, `rpad`, `translate` [datafusion]

Reply via email to