Jorge Leitão created ARROW-14383:
------------------------------------
Summary: [C++] [Python] Does a sliced StructArray roundtrip on c
data interface?
Key: ARROW-14383
URL: https://issues.apache.org/jira/browse/ARROW-14383
Project: Apache Arrow
Issue Type: Bug
Components: C++, Python
Affects Versions: 5.0.0
Reporter: Jorge Leitão
I am struggling to roundtrip a sliced StructArray over the c data interface.
Consider the array:
{code:python}
fields = [
("f1", pyarrow.int32()),
("f2", pyarrow.string()),
]
a = pyarrow.array(
[
{"f1": 1, "f2": "a"},
None,
{"f1": 3, "f2": None},
{"f1": None, "f2": "d"},
{"f1": None, "f2": None},
],
pyarrow.struct(fields),
).slice(1, 2)
{code}
When reading this array from the c data interface, I get:
{code:java}
array: Ffi_ArrowArray {
length: 2,
null_count: 1,
offset: 1,
n_buffers: 1,
n_children: 2,
buffers: 0x00007f61796091c0,
children: 0x00007f6179609280,
dictionary: 0x0000000000000000,
release: Some(
0x00007f617aef2ba0,
),
private_data: 0x00007f617960b3c0,
}
child #0: Ffi_ArrowArray {
length: 5,
null_count: 2,
offset: 0,
n_buffers: 2,
n_children: 0,
buffers: 0x00007f0f49609200,
children: 0x0000000000000000,
dictionary: 0x0000000000000000,
release: Some(
0x00007f0f4aec9ba0,
),
private_data: 0x00007f0f4960b480,
}
child #1: Ffi_ArrowArray {
length: 5,
null_count: 2,
offset: 0,
n_buffers: 3,
n_children: 0,
buffers: 0x00007f0f49609240,
children: 0x0000000000000000,
dictionary: 0x0000000000000000,
release: Some(
0x00007f0f4aec9ba0,
),
private_data: 0x00007f0f4960b540,
}
{code}
This does not seem consistent with what the Python API offers:
{code:python}
print(a.field(0).offset, len(a.field(0))) # 1 2 <- shouldn't it be 5 0?
{code}
Secondly and most importantly, the condition that each child's length must
equal the array's own length is violated (children length is 5, array's length
is 2 in the example above).
We could argue that a consumer MUST slice each child to achieve the desired
behavior, but that won't roundtrip because, when writing the StructArray (after
consuming it), we would now write
{code}
write child: Ffi_ArrowArray {
length: 2,
null_count: 0,
offset: 1,
n_buffers: 2,
n_children: 0,
buffers: 0x00000000021c8b20,
children: 0x0000000000000008,
dictionary: 0x0000000000000000,
release: Some(
0x00007fb1f8d536c0,
),
private_data: 0x00000000024f0db0,
}
write child: Ffi_ArrowArray {
length: 2,
null_count: 1,
offset: 1,
n_buffers: 3,
n_children: 0,
buffers: 0x00000000024998f0,
children: 0x0000000000000008,
dictionary: 0x0000000000000000,
release: Some(
0x00007fb1f8d536c0,
),
private_data: 0x0000000002499910,
}
Ffi_ArrowArray {
length: 2,
null_count: 1,
offset: 1,
n_buffers: 1,
n_children: 2,
buffers: 0x00000000024f12d0,
children: 0x00000000021c8ae0,
dictionary: 0x0000000000000000,
release: Some(
0x00007fb1f8d536c0,
),
private_data: 0x00000000024999c0,
}
{code}
is consumed as
{code}
print(b.field(0).offset, len(b.field(0))) # 2 1 <------------ why?
print(b.offset, len(b)) # 1 2 <-- OK
{code}
which causes the check in [this
line|https://github.com/apache/arrow/blob/b73af9a1607caa4a04e1a11896aed6669847a4d4/cpp/src/arrow/array/validate.cc#L115]
to fail.
I was unable to find a test for a roundtrip of a sliced struct [in pyarrow
tests|https://github.com/apache/arrow/blob/5ead37593472c42f61c76396dde7dcb8954bde70/python/pyarrow/tests/test_cffi.py]
to compare my test with a reference test, but it seems to me that when we
slice a StructArray, we should slice its children accordingly so that its C
data interface yields a consistent result?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)