parthchandra commented on code in PR #3922:
URL: https://github.com/apache/datafusion-comet/pull/3922#discussion_r3074861178
##########
native/spark-expr/src/conversion_funcs/string.rs:
##########
@@ -446,16 +480,30 @@ fn parse_string_to_decimal(input_str: &str, precision:
u8, scale: i8) -> SparkRe
let mut start = 0;
let mut end = string_bytes.len();
- // trim whitespaces
- while start < end && string_bytes[start].is_ascii_whitespace() {
+ // Trim ASCII whitespace and null bytes from both ends. Spark's UTF8String
+ // trims null bytes the same way it trims whitespace: "123\u0000" and
+ // "\u0000123" both parse as 123. Null bytes in the middle are not trimmed
+ // and will fail the digit validation in parse_decimal_str, producing NULL.
+ while start < end && (string_bytes[start].is_ascii_whitespace() ||
string_bytes[start] == 0) {
start += 1;
}
- while end > start && string_bytes[end - 1].is_ascii_whitespace() {
+ while end > start && (string_bytes[end - 1].is_ascii_whitespace() ||
string_bytes[end - 1] == 0)
+ {
end -= 1;
}
let trimmed = &input_str[start..end];
+ // Normalize fullwidth digits to ASCII. Fast path skips the allocation for
+ // pure-ASCII strings, which is the common case.
+ let normalized;
+ let trimmed = if trimmed.bytes().any(|b| b > 0x7F) {
Review Comment:
The previous loop only moves the beginning and end, so never loop over the
non-whitespace characters. This loop passes over the middle part. Doubt if this
can be improved.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]