rtbs-dev opened a new issue, #48953:
URL: https://github.com/apache/arrow/issues/48953

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Testing out the new span extractor discussed from #44615 (and the associated 
PR). 
   Trying to replicate pandas `findall`, but it is only matching the first 
result. This prevents common whitespace tokenization patterns from working, 
entirely. 
   
   Assuming a string array `s`: 
   
   ```python
   # PyArrow
   patt = r'\b(?<token>(?:\#[\w\d]+)|(?:\w\/\w)|(?:\w[\w\'\d]+))\b'
   arr = pc.extract_regex(s,patt)
   
   # Pandas
   token_patt = re.compile(r'\b(?:(?:\#[\w\d]+)|(?:\w\/\w)|(?:\w[\w\'\d]+))\b', 
flags=re.M|re.DOTALL|re.IGNORECASE)
   s.to_pandas().str.findall(token_patt)
   ```
   The latter example gets all of the tokens from a sample array of sentences, 
while the former only selects the first token. 
   More interestingly, we would like to use the span extractor (since the 
equivalent pandas does not exist): 
   
   ```python
   # PyArrow
   arr = pc.extract_regex_span(s,patt)
   
   
   # Pandas
   spans, tokens = zip(*(
       zip(*[
           (m.span(), m[0].lower()) 
           for m in token_patt.finditer(doc)
       ])
       for doc in s.to_pylist()
   ))
   ```
   
   Is this just a limitation of RE2? I know it doesn't support a global `\g` 
flag. Interestingly, google's AI says something along the lines of 
   
   > `extract_regex` (Find All Matches):
   >
   >    This function inherently finds all matches by default (similar to a 
global search) and returns structs with capture groups.
   >    Example: `pyarrow.compute.extract_regex(strings, pattern=r"(\d+)")` 
extracts all sequences of digits.
   
   If that's incorrect/hallucination, maybe we can add some explicit note in 
the docs that this is not global, until a solution is made? 
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to