[ 
https://issues.apache.org/jira/browse/ARROW-12889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17352042#comment-17352042
 ] 

Dror Speiser commented on ARROW-12889:
--------------------------------------

This is same as #ARROW-12774 !

Closing...

> [Python] compute.replace_substring_regex sometimes returns incorrect offsets, 
> causing crashes/ub
> ------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12889
>                 URL: https://issues.apache.org/jira/browse/ARROW-12889
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 4.0.0
>         Environment: ubuntu 20.04 or macos catalina running docker engine 
> 20.10.2 and python 3.8.6
>            Reporter: Dror Speiser
>            Priority: Major
>              Labels: compute, pyarrow
>
> I've come across examples where calling 
> `pyarrow.compute.replace_substring_regex` caused a segfault once using the 
> result. After some experimentation, I found that the problem lies in the 
> offsets buffer in the result of the computation.
> Here is a docker file that reproduces the problem in a few lines (though 
> without an immediate crash):
> {code:java}
> FROM python:3.8
> RUN pip install pyarrow
> RUN echo "import pyarrow; \
>     import pyarrow.compute; \
>     options = pyarrow.compute.ReplaceSubstringOptions('a', ''); \
>     values = [''] * 16; \
>     arr = pyarrow.array(values, pyarrow.string()); \
>     res = pyarrow.compute.replace_substring_regex(arr, options=options); \
>     offsets = res.buffers()[1]; \
>     assert any(offset != 0 for offset in offsets[-4:]);" > /test.py
> RUN python /test.py
> {code}
> The docker image installs pyarrow (4.0.0 at the time of submitting this 
> issue), and then runs python code which creates an array of 16 empty strings, 
> and calls `replace_substring_regex` on the array.
>  The offsets buffer's last 4 bytes (representing the last offset) are checked 
> to be non-zero, which fails.
> Everything but the last offset looks fine: the valid buffer, the rest of the 
> offsets, and the data buffer.
> I have more elaborate examples of arrays which return a random value for the 
> last offset, causing crashes sooner than simply 0 at the end.
>  Another hint which might help, the problem occurs at multiples of 16, i.e. 
> changing 16 to 32, 48, etc. still shows the problem, but other values don't 
> have a problem.
>   
>  When I cloned the latest master, built arrow, and run the example - there 
> was no problem. But since I didn't see the issue here on JIRA, I thought I 
> should probably post it. I have no idea if I'm building correctly, and maybe 
> I'm adding a bug to a bug :)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to