[ 
https://issues.apache.org/jira/browse/ARROW-16430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611673#comment-17611673
 ] 

Steve M. Kim commented on ARROW-16430:
--------------------------------------

ARROW-2022 introduced a related problem with {{Schema}} messages. I am not sure 
whether the related problem ought to be tracked in this issue or in a separate 
issue, or discussed further on the mailing list.

The current 
[documentation|https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata]
 for the IPC format says
{quote}We provide a {{custom_metadata}} field at three levels to provide a 
mechanism for developers to pass application-specific metadata in Arrow 
protocol messages. This includes {{{}Field{}}}, {{{}Schema{}}}, and 
{{{}Message{}}}.
{quote}
Consistent with the documentation, the FlatBuffer definitions have two 
different {{custom_metadata}} fields that appear in an encapsulated message of 
type Schema:
* The {{custom_metadata}} field within the {{Schema}} table
* The {{custom_metadata}} field within the parent {{Message}} table

 

Currently, the pyarrow implementation recognizes only the custom_metadata field 
in the Schema table and is unaware of the custom_metadata field in the parent 
Message table. The proposed change [https://github.com/apache/arrow/pull/13041] 
will use the custom_metadata field in the parent Message table for RecordBatch 
messages, but it won't address this ambiguity with Schema messages.

I think that it is useful for {{pyarrow.Schema}} object and 
{{pyarrow.RecordBatch}} object to carry custom metadata, independent of their 
IPC message serialization. I also think that perhaps {{pyarrow.Table}} ought to 
carry its own custom metadata that is separate from the metadata of its 
{{{}Schema{}}}, because a {{Table}} is like a {{{}RecordBatch{}}}. In the 
current implementation, attempting to instantiate a {{Table}} with both a 
metadata-enriched {{Schema}} and a separate metadata dict raises 
{{{}ValueError: Cannot pass both schema and metadata{}}}. This behavior is 
inconsistent with the documentation, which distinguishes the metadata of a 
Schema from the metadata of a Message.

> [Python] Read/Write record batch custom metadata API in pyarrow
> ---------------------------------------------------------------
>
>                 Key: ARROW-16430
>                 URL: https://issues.apache.org/jira/browse/ARROW-16430
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 7.0.0
>            Reporter: Yue Ni
>            Assignee: Yue Ni
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> In https://issues.apache.org/jira/browse/ARROW-16131, Arrow C++ APIs were 
> added so that users can read/write record batch custom metadata for IPC file. 
> But pyarrow still lacks corresponding APIs for doing this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to