[
https://issues.apache.org/jira/browse/ARROW-15613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jorge Leitão updated ARROW-15613:
---------------------------------
Description:
While trying to roundtrip an extension from schema.metadata (see ARROW-13855
for details), I got invalid utf8, which imo goes against
> A binary string describing the type’s metadata [1]
Specifically, a field
field = pyarrow.field("aa", UuidType())
contains the following:
```
key len: 20
key: "ARROW:extension:name"
value len: 23
value: "arrow.py_extension_type"
key len: 24
key: "ARROW:extension:metadata"
value len: 28
```
with the value's data for this key being:
```
[128, 3, 99, 116, 101, 115, 116, 95, 115, 113, 108, 10, 85, 117, 105, 100, 84,
121, 112, 101, 10, 113, 0, 41, 82, 113, 1, 46]
```
This is not a valid utf8 (see e.g.
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=02b67658b3cddf8dc095bc9750fa7032).
Maybe I am reading the values incorrectly? (null point?)
[1]
https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata
was:
While trying to roundtrip an extension from schema.metadata (see ARROW-13855
for details), I got invalid utf8, which imo goes against
> A binary string describing the type’s metadata [1]
Specifically, a field
field = pyarrow.field("aa", UuidType())
contains the following:
```
key len: 20
key: "ARROW:extension:name"
value len: 23
value: "arrow.py_extension_type"
key len: 24
key: "ARROW:extension:metadata"
value len: 28
```
with the values' data:
```
[128, 3, 99, 116, 101, 115, 116, 95, 115, 113, 108, 10, 85, 117, 105, 100, 84,
121, 112, 101, 10, 113, 0, 41, 82, 113, 1, 46]
```
This is not a valid utf8 (see e.g.
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=02b67658b3cddf8dc095bc9750fa7032).
Maybe I am reading the values incorrectly, but I would expect valid utf8 (like
in the IPC format).
[1]
https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata
> [C++][Python] Metadata from C data interface is not valid utf8
> --------------------------------------------------------------
>
> Key: ARROW-15613
> URL: https://issues.apache.org/jira/browse/ARROW-15613
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Reporter: Jorge Leitão
> Priority: Major
>
> While trying to roundtrip an extension from schema.metadata (see ARROW-13855
> for details), I got invalid utf8, which imo goes against
> > A binary string describing the type’s metadata [1]
> Specifically, a field
> field = pyarrow.field("aa", UuidType())
> contains the following:
> ```
> key len: 20
> key: "ARROW:extension:name"
> value len: 23
> value: "arrow.py_extension_type"
> key len: 24
> key: "ARROW:extension:metadata"
> value len: 28
> ```
> with the value's data for this key being:
> ```
> [128, 3, 99, 116, 101, 115, 116, 95, 115, 113, 108, 10, 85, 117, 105, 100,
> 84, 121, 112, 101, 10, 113, 0, 41, 82, 113, 1, 46]
> ```
> This is not a valid utf8 (see e.g.
> https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=02b67658b3cddf8dc095bc9750fa7032).
> Maybe I am reading the values incorrectly? (null point?)
> [1]
> https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata
--
This message was sent by Atlassian Jira
(v8.20.1#820001)