[
https://issues.apache.org/jira/browse/ARROW-14383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437454#comment-17437454
]
Jorge Leitão commented on ARROW-14383:
--------------------------------------
This behavior (of slicing the child) is also present in `FixedSizeListArray`.
> [C++] [Python] Does a sliced StructArray roundtrip on c data interface?
> -----------------------------------------------------------------------
>
> Key: ARROW-14383
> URL: https://issues.apache.org/jira/browse/ARROW-14383
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 5.0.0
> Reporter: Jorge Leitão
> Priority: Major
>
> I am struggling to roundtrip a sliced StructArray over the c data interface.
> Consider the array:
> {code:python}
> fields = [
> ("f1", pyarrow.int32()),
> ("f2", pyarrow.string()),
> ]
> a = pyarrow.array(
> [
> {"f1": 1, "f2": "a"},
> None,
> {"f1": 3, "f2": None},
> {"f1": None, "f2": "d"},
> {"f1": None, "f2": None},
> ],
> pyarrow.struct(fields),
> ).slice(1, 2)
> {code}
> When reading this array from the c data interface, I get:
> {code:java}
> array: Ffi_ArrowArray {
> length: 2,
> null_count: 1,
> offset: 1,
> n_buffers: 1,
> n_children: 2,
> buffers: 0x00007f61796091c0,
> children: 0x00007f6179609280,
> dictionary: 0x0000000000000000,
> release: Some(
> 0x00007f617aef2ba0,
> ),
> private_data: 0x00007f617960b3c0,
> }
> child #0: Ffi_ArrowArray {
> length: 5,
> null_count: 2,
> offset: 0,
> n_buffers: 2,
> n_children: 0,
> buffers: 0x00007f0f49609200,
> children: 0x0000000000000000,
> dictionary: 0x0000000000000000,
> release: Some(
> 0x00007f0f4aec9ba0,
> ),
> private_data: 0x00007f0f4960b480,
> }
> child #1: Ffi_ArrowArray {
> length: 5,
> null_count: 2,
> offset: 0,
> n_buffers: 3,
> n_children: 0,
> buffers: 0x00007f0f49609240,
> children: 0x0000000000000000,
> dictionary: 0x0000000000000000,
> release: Some(
> 0x00007f0f4aec9ba0,
> ),
> private_data: 0x00007f0f4960b540,
> }
> {code}
> This does not seem consistent with what the Python API offers:
> {code:python}
> print(a.field(0).offset, len(a.field(0))) # 1 2 <- shouldn't it be 0 5? (or
> better, vice-versa)
> {code}
> Secondly and most importantly, the condition that each child's length must
> equal the array's own length is violated (children length is 5, array's
> length is 2 in the example above).
> We could argue that a consumer MUST slice each child to achieve the desired
> behavior, but that won't roundtrip because, when writing the StructArray
> (after consuming it), we would now write
> {code}
> write child: Ffi_ArrowArray {
> length: 2,
> null_count: 0,
> offset: 1,
> n_buffers: 2,
> n_children: 0,
> buffers: 0x00000000021c8b20,
> children: 0x0000000000000008,
> dictionary: 0x0000000000000000,
> release: Some(
> 0x00007fb1f8d536c0,
> ),
> private_data: 0x00000000024f0db0,
> }
> write child: Ffi_ArrowArray {
> length: 2,
> null_count: 1,
> offset: 1,
> n_buffers: 3,
> n_children: 0,
> buffers: 0x00000000024998f0,
> children: 0x0000000000000008,
> dictionary: 0x0000000000000000,
> release: Some(
> 0x00007fb1f8d536c0,
> ),
> private_data: 0x0000000002499910,
> }
> Ffi_ArrowArray {
> length: 2,
> null_count: 1,
> offset: 1,
> n_buffers: 1,
> n_children: 2,
> buffers: 0x00000000024f12d0,
> children: 0x00000000021c8ae0,
> dictionary: 0x0000000000000000,
> release: Some(
> 0x00007fb1f8d536c0,
> ),
> private_data: 0x00000000024999c0,
> }
> {code}
> is consumed as
> {code}
> print(b.field(0).offset, len(b.field(0))) # 2 1 <------------ why?
> print(b.offset, len(b)) # 1 2 <-- OK
> {code}
> which causes the check in [this
> line|https://github.com/apache/arrow/blob/b73af9a1607caa4a04e1a11896aed6669847a4d4/cpp/src/arrow/array/validate.cc#L115]
> to fail.
> I was unable to find a test for a roundtrip of a sliced struct [in pyarrow
> tests|https://github.com/apache/arrow/blob/5ead37593472c42f61c76396dde7dcb8954bde70/python/pyarrow/tests/test_cffi.py]
> to compare my test with a reference test, but it seems to me that when we
> slice a StructArray, we should slice its children accordingly so that its C
> data interface yields a consistent result?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)