[
https://issues.apache.org/jira/browse/ARROW-16430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611673#comment-17611673
]
Steve M. Kim commented on ARROW-16430:
--------------------------------------
ARROW-2022 introduced a related problem with {{Schema}} messages. I am not sure
whether the related problem ought to be tracked in this issue or in a separate
issue, or discussed further on the mailing list.
The current
[documentation|https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata]
for the IPC format says
{quote}We provide a {{custom_metadata}} field at three levels to provide a
mechanism for developers to pass application-specific metadata in Arrow
protocol messages. This includes {{{}Field{}}}, {{{}Schema{}}}, and
{{{}Message{}}}.
{quote}
Consistent with the documentation, the FlatBuffer definitions have two
different {{custom_metadata}} fields that appear in an encapsulated message of
type Schema:
* The {{custom_metadata}} field within the {{Schema}} table
* The {{custom_metadata}} field within the parent {{Message}} table
Currently, the pyarrow implementation recognizes only the custom_metadata field
in the Schema table and is unaware of the custom_metadata field in the parent
Message table. The proposed change [https://github.com/apache/arrow/pull/13041]
will use the custom_metadata field in the parent Message table for RecordBatch
messages, but it won't address this ambiguity with Schema messages.
I think that it is useful for {{pyarrow.Schema}} object and
{{pyarrow.RecordBatch}} object to carry custom metadata, independent of their
IPC message serialization. I also think that perhaps {{pyarrow.Table}} ought to
carry its own custom metadata that is separate from the metadata of its
{{{}Schema{}}}, because a {{Table}} is like a {{{}RecordBatch{}}}. In the
current implementation, attempting to instantiate a {{Table}} with both a
metadata-enriched {{Schema}} and a separate metadata dict raises
{{{}ValueError: Cannot pass both schema and metadata{}}}. This behavior is
inconsistent with the documentation, which distinguishes the metadata of a
Schema from the metadata of a Message.
> [Python] Read/Write record batch custom metadata API in pyarrow
> ---------------------------------------------------------------
>
> Key: ARROW-16430
> URL: https://issues.apache.org/jira/browse/ARROW-16430
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 7.0.0
> Reporter: Yue Ni
> Assignee: Yue Ni
> Priority: Major
> Labels: pull-request-available
> Time Spent: 7h 20m
> Remaining Estimate: 0h
>
> In https://issues.apache.org/jira/browse/ARROW-16131, Arrow C++ APIs were
> added so that users can read/write record batch custom metadata for IPC file.
> But pyarrow still lacks corresponding APIs for doing this.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)