[
https://issues.apache.org/jira/browse/ARROW-12889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17352042#comment-17352042
]
Dror Speiser commented on ARROW-12889:
--------------------------------------
This is same as #ARROW-12774 !
Closing...
> [Python] compute.replace_substring_regex sometimes returns incorrect offsets,
> causing crashes/ub
> ------------------------------------------------------------------------------------------------
>
> Key: ARROW-12889
> URL: https://issues.apache.org/jira/browse/ARROW-12889
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 4.0.0
> Environment: ubuntu 20.04 or macos catalina running docker engine
> 20.10.2 and python 3.8.6
> Reporter: Dror Speiser
> Priority: Major
> Labels: compute, pyarrow
>
> I've come across examples where calling
> `pyarrow.compute.replace_substring_regex` caused a segfault once using the
> result. After some experimentation, I found that the problem lies in the
> offsets buffer in the result of the computation.
> Here is a docker file that reproduces the problem in a few lines (though
> without an immediate crash):
> {code:java}
> FROM python:3.8
> RUN pip install pyarrow
> RUN echo "import pyarrow; \
> import pyarrow.compute; \
> options = pyarrow.compute.ReplaceSubstringOptions('a', ''); \
> values = [''] * 16; \
> arr = pyarrow.array(values, pyarrow.string()); \
> res = pyarrow.compute.replace_substring_regex(arr, options=options); \
> offsets = res.buffers()[1]; \
> assert any(offset != 0 for offset in offsets[-4:]);" > /test.py
> RUN python /test.py
> {code}
> The docker image installs pyarrow (4.0.0 at the time of submitting this
> issue), and then runs python code which creates an array of 16 empty strings,
> and calls `replace_substring_regex` on the array.
> The offsets buffer's last 4 bytes (representing the last offset) are checked
> to be non-zero, which fails.
> Everything but the last offset looks fine: the valid buffer, the rest of the
> offsets, and the data buffer.
> I have more elaborate examples of arrays which return a random value for the
> last offset, causing crashes sooner than simply 0 at the end.
> Another hint which might help, the problem occurs at multiples of 16, i.e.
> changing 16 to 32, 48, etc. still shows the problem, but other values don't
> have a problem.
>
> When I cloned the latest master, built arrow, and run the example - there
> was no problem. But since I didn't see the issue here on JIRA, I thought I
> should probably post it. I have no idea if I'm building correctly, and maybe
> I'm adding a bug to a bug :)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)