maartenbreddels opened a new pull request #9000: URL: https://github.com/apache/arrow/pull/9000
Needs a rebase after https://github.com/apache/arrow/pull/8621 is merged I totally agree with https://github.com/python/cpython/blob/c9bc290dd6e3994a4ead2a224178bcba86f0c0e4/Objects/sliceobject.c#L252 This was tricky to get right, the main difficulty is in manually dealing with reverse iterators. Therefore I put on extra guardrails by having the Python unittests cover a lot of cases. All edge cases detected by this are translated to the C++ unittest suite, so we could reduce them to reduce pytest execution cost (I added 1 second). Slicing is based on Python, `[start, stop)` inclusive/exclusive semantics, where an index refers to a codeunit (like Python apparently, badly documented), and negative indices start counting from the right. `step != 0` is supported, like Python. The only thing we cannot support easily, are things like reversing a string, since in Python one can do `s[::-1]` or `s[-1::-1]`, but we don't support empty values with the Option machinery (we model this as an c-`int64`). To mimic this, we can do `pc.utf8_slice_codeunits(ar, start=-1, end=-sys.maxsize, step=-1)` (i.e. a very large negative value). For instance, libraries such as Pandas and Vaex can do sth like that, confirmed to be working by modifying the unittest like this: ```python import sys @pytest.mark.parametrize('start', list(range(-6, 6)) + [None]) @pytest.mark.parametrize('stop', list(range(-6, 6)) + [None]) @pytest.mark.parametrize('step', [-3, -2, -1, 1, 2, 3]) def test_slice_compatibility(start,stop, step): input = pa.array(["", "𝑓", "𝑓ö", "𝑓öõ", "𝑓öõḍ", "𝑓öõḍš"]) expected = pa.array([k.as_py()[start:stop:step] for k in input]) if start is None: start = -sys.maxsize if step > 0 else sys.maxsize if stop is None: stop = sys.maxsize if step > 0 else -sys.maxsize result = pc.utf8_slice_codeunits(input, start=start, stop=stop, step=step) assert expected.equals(result) ``` So libraries using this can implement the full Python behavior with this workaround. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org