jonded94 opened a new pull request, #45471:
URL: https://github.com/apache/arrow/pull/45471
### Rationale for this change
Currently, unfortunately `MapScalar`/`Array` types are not deserialized into
proper Python `dict`s, which is unfortunate since this breaks "roundtrips" from
Python -> Arrow -> Python:
```
import pyarrow as pa
schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))])
data = [{'x': {'a': 1}}]
pa.RecordBatch.from_pylist(data, schema=schema).to_pylist()
# [{'x': [('a', 1)]}]
```
This is especially bad when storing TiBs of deeply nested data (think of
lists in structs in maps...) that were created from Python and serialized into
Arrow/Parquet, since they can't be read in again with native `pyarrow` methods
without doing extremely ugly and computationally costly workarounds.
### What changes are included in this PR?
A new parameter `maps_as_pydicts` is introduced to `to_pylist`, `to_pydict`,
`as_py` which will allow proper roundtrips:
```
import pyarrow as pa
schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))])
data = [{'x': {'a': 1}}]
pa.RecordBatch.from_pylist(data,
schema=schema).to_pylist(maps_as_pydicts=True)
# [{'x': {'a': 1}}]
```
### Are these changes tested?
Yes. There are tests for `to_pylist` and `to_pydict` included for
`pyarrow.Table`, whilst low-level `MapScalar` and especially a nesting with
`ListScalar` and `StructScalar` is tested.
Also, duplicate keys now should throw an error, which is also tested for.
### Are there any user-facing changes?
No callsites should be broken, simlpy a new kwarg is added.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]