Jorge Leitão created ARROW-14383:
------------------------------------

             Summary: [C++] [Python] Does a sliced StructArray roundtrip on c 
data interface?
                 Key: ARROW-14383
                 URL: https://issues.apache.org/jira/browse/ARROW-14383
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
    Affects Versions: 5.0.0
            Reporter: Jorge Leitão


I am struggling to roundtrip a sliced StructArray over the c data interface.

Consider the array:

{code:python}
fields = [
            ("f1", pyarrow.int32()),
            ("f2", pyarrow.string()),
        ]
        a = pyarrow.array(
            [
                {"f1": 1, "f2": "a"},
                None,
                {"f1": 3, "f2": None},
                {"f1": None, "f2": "d"},
                {"f1": None, "f2": None},
            ],
            pyarrow.struct(fields),
        ).slice(1, 2)
{code}

When reading this array from the c data interface, I get:

{code:java}
array: Ffi_ArrowArray {
    length: 2,
    null_count: 1,
    offset: 1,
    n_buffers: 1,
    n_children: 2,
    buffers: 0x00007f61796091c0,
    children: 0x00007f6179609280,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007f617aef2ba0,
    ),
    private_data: 0x00007f617960b3c0,
}

child #0: Ffi_ArrowArray {
    length: 5,
    null_count: 2,
    offset: 0,
    n_buffers: 2,
    n_children: 0,
    buffers: 0x00007f0f49609200,
    children: 0x0000000000000000,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007f0f4aec9ba0,
    ),
    private_data: 0x00007f0f4960b480,
}

child #1: Ffi_ArrowArray {
    length: 5,
    null_count: 2,
    offset: 0,
    n_buffers: 3,
    n_children: 0,
    buffers: 0x00007f0f49609240,
    children: 0x0000000000000000,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007f0f4aec9ba0,
    ),
    private_data: 0x00007f0f4960b540,
}
{code}

This does not seem consistent with what the Python API offers:
{code:python}
print(a.field(0).offset, len(a.field(0))) # 1 2 <- shouldn't it be 5 0?
{code}

Secondly and most importantly, the condition that each child's length must 
equal the array's own length is violated (children length is 5, array's length 
is 2 in the example above).

We could argue that a consumer MUST slice each child to achieve the desired 
behavior, but that won't roundtrip because, when writing the StructArray (after 
consuming it), we would now write

{code}
write child: Ffi_ArrowArray {
    length: 2,
    null_count: 0,
    offset: 1,
    n_buffers: 2,
    n_children: 0,
    buffers: 0x00000000021c8b20,
    children: 0x0000000000000008,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007fb1f8d536c0,
    ),
    private_data: 0x00000000024f0db0,
}
write child: Ffi_ArrowArray {
    length: 2,
    null_count: 1,
    offset: 1,
    n_buffers: 3,
    n_children: 0,
    buffers: 0x00000000024998f0,
    children: 0x0000000000000008,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007fb1f8d536c0,
    ),
    private_data: 0x0000000002499910,
}
Ffi_ArrowArray {
    length: 2,
    null_count: 1,
    offset: 1,
    n_buffers: 1,
    n_children: 2,
    buffers: 0x00000000024f12d0,
    children: 0x00000000021c8ae0,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007fb1f8d536c0,
    ),
    private_data: 0x00000000024999c0,
}
{code}

is consumed as 

{code}
print(b.field(0).offset, len(b.field(0))) # 2 1 <------------ why?
print(b.offset, len(b))  # 1 2 <-- OK
{code}

which causes the check in [this 
line|https://github.com/apache/arrow/blob/b73af9a1607caa4a04e1a11896aed6669847a4d4/cpp/src/arrow/array/validate.cc#L115]
 to fail.

I was unable to find a test for a roundtrip of a sliced struct [in pyarrow 
tests|https://github.com/apache/arrow/blob/5ead37593472c42f61c76396dde7dcb8954bde70/python/pyarrow/tests/test_cffi.py]
 to compare my test with a reference test, but it seems to me that when we 
slice a StructArray, we should slice its children accordingly so that its C 
data interface yields a consistent result?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to