Dror Speiser created ARROW-12889:
------------------------------------
Summary: [Python] compute.replace_substring_regex sometimes
returns incorrect offsets, causing crashes/ub
Key: ARROW-12889
URL: https://issues.apache.org/jira/browse/ARROW-12889
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 4.0.0
Environment: ubuntu 20.04 or macos catalina running docker engine
20.10.2 and python 3.8.6
Reporter: Dror Speiser
I've come across examples where calling
`pyarrow.compute.replace_substring_regex` caused a segfault once using the
result. After some experimentation, I found that the problem lies in the
offsets buffer in the result of the computation.
Here is a docker file that reproduces the problem in a few lines (though
without an immediate crash):
{code:java}
FROM python:3.8
RUN pip install pyarrow
RUN echo "import pyarrow; \
import pyarrow.compute; \
options = pyarrow.compute.ReplaceSubstringOptions('a', ''); \
values = [''] * 16; \
arr = pyarrow.array(values, pyarrow.string()); \
res = pyarrow.compute.replace_substring_regex(arr, options=options); \
offsets = res.buffers()[1]; \
assert any(offset != 0 for offset in offsets[-4:]);" > /test.py
RUN python /test.py
{code}
The docker image installs pyarrow (4.0.0 at the time of submitting this issue),
and then runs python code which creates an array of 16 empty strings, and calls
`replace_substring_regex` on the array.
The offsets buffer's last 4 bytes (representing the last offset) are checked
to be non-zero, which fails.
Everything but the last offset looks fine: the valid buffer, the rest of the
offsets, and the data buffer.
I have more elaborate examples of arrays which return a random value for the
last offset, causing crashes sooner than simply 0 at the end.
Another hint which might help, the problem occurs at multiples of 16, i.e.
changing 16 to 32, 48, etc. still shows the problem, but other values don't
have a problem.
When I cloned the latest master, built arrow, and run the example - there was
no problem. But since I didn't see the issue here on JIRA, I thought I should
probably post it. I have no idea if I'm building correctly, and maybe I'm
adding a bug to a bug :)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)