randolf-scholz commented on issue #34976: URL: https://github.com/apache/arrow/issues/34976#issuecomment-1503618134
@jorisvandenbossche > As you show, we already have the utf8_is_numeric kernel, but the problem is that this is too simplistic and fails if there is a "+" or "-" in the string, or a thousands separate, ..? The `utf8_is_numeric` apparently checks if the character is marked as belonging to some numeric category by unicode: https://github.com/apache/arrow/blob/45918a90a6ca1cf3fd67c256a7d6a240249e555a/cpp/src/arrow/compute/kernels/scalar_string_utf8.cc#L144-L151 As my example shows it is pretty much unrelated to whether a string can be casted to float (it will return `False` for the string "1.0", because "." is not a character classified as numeric by Unicode). Therefore, it would be nice to have a utility function `utf8_is_float` that returns `True` if the string can be cast to `float`. The coersion makes it slightly annoying to figure out for which non-null items the conversion failed because we have to do some mask arithmetic ```python conversion_failed_mask = pa.compute.and_( array.cast(pa.float32(), errors="coerce").is_null(), pa.compute.invert(array.is_null()) ) non_float_items = array.filter(conversion_failed_mask) ``` vs. ```pyhton non_float_items = array.filter(pa.compute.invert(cp.compute.utf8_is_float(array))) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
