randolf-scholz commented on issue #34976:
URL: https://github.com/apache/arrow/issues/34976#issuecomment-1503618134

   @jorisvandenbossche 
   
   > As you show, we already have the utf8_is_numeric kernel, but the problem 
is that this is too simplistic and fails if there is a "+" or "-" in the 
string, or a thousands separate, ..?
   
   The `utf8_is_numeric` apparently checks if the character is marked as 
belonging to some numeric category by unicode:
   
   
https://github.com/apache/arrow/blob/45918a90a6ca1cf3fd67c256a7d6a240249e555a/cpp/src/arrow/compute/kernels/scalar_string_utf8.cc#L144-L151
   
   As my example shows it is pretty much unrelated to whether a string can be 
casted to float (it will return `False` for the string "1.0", because "." is 
not a character classified as numeric by Unicode). Therefore, it would be nice 
to have a utility function `utf8_is_float` that returns `True` if the string 
can be cast to `float`. The coersion makes it slightly annoying to figure out 
for which non-null items the conversion failed because we have to do some mask 
arithmetic
   
   ```python
   conversion_failed_mask = pa.compute.and_(
       array.cast(pa.float32(), errors="coerce").is_null(),
       pa.compute.invert(array.is_null())
   )
   non_float_items = array.filter(conversion_failed_mask)
   ```
   
   vs.
   
   ```pyhton
   non_float_items =  
array.filter(pa.compute.invert(cp.compute.utf8_is_float(array)))
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to