[
https://issues.apache.org/jira/browse/ARROW-15613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488916#comment-17488916
]
Antoine Pitrou commented on ARROW-15613:
----------------------------------------
There is actually a discussion to relax the utf8 requirement in IPC metadata
values (see the message recently posted by [~jorisvandenbossche] "Re:
[DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] -
Dynamic schema changes in ArrowFlight streams").
In short: yes, Arrow C++ and PyArrow can put arbitrary binary data in metadata
values.
Also cc [~lidavidm] [~emkornfield]
> [C++][Python] Metadata from C data interface is not valid utf8
> --------------------------------------------------------------
>
> Key: ARROW-15613
> URL: https://issues.apache.org/jira/browse/ARROW-15613
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Reporter: Jorge Leitão
> Priority: Major
>
> While trying to roundtrip an extension from schema.metadata (see ARROW-13855
> for details), I got invalid utf8, which imo goes against
> > A binary string describing the type’s metadata [1]
> Specifically, a field
> field = pyarrow.field("aa", UuidType())
> contains the following:
> ```
> key len: 20
> key: "ARROW:extension:name"
> value len: 23
> value: "arrow.py_extension_type"
> key len: 24
> key: "ARROW:extension:metadata"
> value len: 28
> ```
> with the value's data for this key being:
> ```
> [128, 3, 99, 116, 101, 115, 116, 95, 115, 113, 108, 10, 85, 117, 105, 100,
> 84, 121, 112, 101, 10, 113, 0, 41, 82, 113, 1, 46]
> ```
> This is not a valid utf8 (see e.g.
> https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=02b67658b3cddf8dc095bc9750fa7032).
> Maybe I am reading the values incorrectly? (null point?)
> [1]
> https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata
--
This message was sent by Atlassian Jira
(v8.20.1#820001)