[jira] [Updated] (ARROW-16131) [C++] Record batch specific metadata is not saved in IPC file

Yue Ni (Jira) Thu, 07 Apr 2022 22:56:04 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-16131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yue Ni updated ARROW-16131:
---------------------------
    Description: 
When writing an IPC file having multiple record batches, the schema provided to 
`IpcFormatWriter` is correctly written to IPC file's footer, however, if the 
record batch written has its batch specific metadata associated with it, this 
metadata is not written.

This can be reproduced with the following test case (using pyarrow):
{code:java}
def test_chunked_record_batch_meta():
    num_batches = 2
    ipc_file = "/tmp/batches_with_metadata.arrow"
    int_array = pa.array([i for i in range(chunk_size)])
    schema = pa.schema(
        [
            ("values", pa.int64()),
        ],
        metadata={"foo": "bar"},
    )
    writer = pa.RecordBatchFileWriter(
        ipc_file, schema
    )
    for i in range(num_batches):
        # follow examples here:
        # 
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_table.py
        batch = pa.record_batch(
            [int_array],
            names=["values"],
            metadata={"batch_id": str(i)},
        )
        writer.write_batch(batch)
    writer.close()
    mmapped_file = pa.memory_map(ipc_file)
    reader = pa.ipc.open_file(mmapped_file)
    batch_0 = reader.get_record_batch(0)
    assert batch_0.schema.metadata {code}

  was:
When writing an IPC file having multiple record batches, the schema provided to 
`IpcFormatWriter` is correctly written to IPC file's footer, however, if the 
record batch written has its batch specific metadata associated with it, this 
metadata is not written.

This can be reproduced with the following test case (using pyarrow):
{code:java}
def test_chunked_record_batch_meta():
    num_batches = 2
    ipc_file = "/tmp/batches_with_metadata.arrow"
    int_array = pa.array([i for i in range(chunk_size)])
    schema = pa.schema(
        [
            ("values", pa.int64()),
        ],
        metadata={"foo": "bar"},
    )
    writer = pa.RecordBatchFileWriter(
        ipc_file, schema
    )
    for i in range(num_batches):
        # follow examples here:
        # 
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_table.py
        batch = pa.record_batch(
            [int_array],
            names=["values"],
            metadata={"batch_id": str},
        )
        writer.write_batch(batch)
    writer.close()
    mmapped_file = pa.memory_map(ipc_file)
    reader = pa.ipc.open_file(mmapped_file)
    batch_0 = reader.get_record_batch(0)
    assert batch_0.schema.metadata {code}


> [C++] Record batch specific metadata is not saved in IPC file
> -------------------------------------------------------------
>
>                 Key: ARROW-16131
>                 URL: https://issues.apache.org/jira/browse/ARROW-16131
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 7.0.0
>            Reporter: Yue Ni
>            Assignee: Yue Ni
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> When writing an IPC file having multiple record batches, the schema provided 
> to `IpcFormatWriter` is correctly written to IPC file's footer, however, if 
> the record batch written has its batch specific metadata associated with it, 
> this metadata is not written.
> This can be reproduced with the following test case (using pyarrow):
> {code:java}
> def test_chunked_record_batch_meta():
>     num_batches = 2
>     ipc_file = "/tmp/batches_with_metadata.arrow"
>     int_array = pa.array([i for i in range(chunk_size)])
>     schema = pa.schema(
>         [
>             ("values", pa.int64()),
>         ],
>         metadata={"foo": "bar"},
>     )
>     writer = pa.RecordBatchFileWriter(
>         ipc_file, schema
>     )
>     for i in range(num_batches):
>         # follow examples here:
>         # 
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_table.py
>         batch = pa.record_batch(
>             [int_array],
>             names=["values"],
>             metadata={"batch_id": str(i)},
>         )
>         writer.write_batch(batch)
>     writer.close()
>     mmapped_file = pa.memory_map(ipc_file)
>     reader = pa.ipc.open_file(mmapped_file)
>     batch_0 = reader.get_record_batch(0)
>     assert batch_0.schema.metadata {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-16131) [C++] Record batch specific metadata is not saved in IPC file

Reply via email to