[ 
https://issues.apache.org/jira/browse/ARROW-15613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-15613:
---------------------------------
    Description: 
While trying to roundtrip an extension from schema.metadata (see ARROW-13855 
for details), I got invalid utf8, which imo goes against

> A binary string describing the type’s metadata [1]

Specifically, a field

field = pyarrow.field("aa", UuidType())

contains the following:

```
key len: 20
key: "ARROW:extension:name"
value len: 23
value: "arrow.py_extension_type"
key len: 24
key: "ARROW:extension:metadata"
value len: 28
```

with the value's data for this key being:

```
[128, 3, 99, 116, 101, 115, 116, 95, 115, 113, 108, 10, 85, 117, 105, 100, 84, 
121, 112, 101, 10, 113, 0, 41, 82, 113, 1, 46]
```

This is not a valid utf8 (see e.g. 
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=02b67658b3cddf8dc095bc9750fa7032).

Maybe I am reading the values incorrectly? (null point?)

[1] 
https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata

  was:
While trying to roundtrip an extension from schema.metadata (see ARROW-13855 
for details), I got invalid utf8, which imo goes against

> A binary string describing the type’s metadata [1]

Specifically, a field

field = pyarrow.field("aa", UuidType())

contains the following:

```
key len: 20
key: "ARROW:extension:name"
value len: 23
value: "arrow.py_extension_type"
key len: 24
key: "ARROW:extension:metadata"
value len: 28
```

with the values' data:

```
[128, 3, 99, 116, 101, 115, 116, 95, 115, 113, 108, 10, 85, 117, 105, 100, 84, 
121, 112, 101, 10, 113, 0, 41, 82, 113, 1, 46]
```

This is not a valid utf8 (see e.g. 
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=02b67658b3cddf8dc095bc9750fa7032).

Maybe I am reading the values incorrectly, but I would expect valid utf8 (like 
in the IPC format).

[1] 
https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata


> [C++][Python] Metadata from C data interface is not valid utf8
> --------------------------------------------------------------
>
>                 Key: ARROW-15613
>                 URL: https://issues.apache.org/jira/browse/ARROW-15613
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Jorge Leitão
>            Priority: Major
>
> While trying to roundtrip an extension from schema.metadata (see ARROW-13855 
> for details), I got invalid utf8, which imo goes against
> > A binary string describing the type’s metadata [1]
> Specifically, a field
> field = pyarrow.field("aa", UuidType())
> contains the following:
> ```
> key len: 20
> key: "ARROW:extension:name"
> value len: 23
> value: "arrow.py_extension_type"
> key len: 24
> key: "ARROW:extension:metadata"
> value len: 28
> ```
> with the value's data for this key being:
> ```
> [128, 3, 99, 116, 101, 115, 116, 95, 115, 113, 108, 10, 85, 117, 105, 100, 
> 84, 121, 112, 101, 10, 113, 0, 41, 82, 113, 1, 46]
> ```
> This is not a valid utf8 (see e.g. 
> https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=02b67658b3cddf8dc095bc9750fa7032).
> Maybe I am reading the values incorrectly? (null point?)
> [1] 
> https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to