jonded94 opened a new pull request, #45471: URL: https://github.com/apache/arrow/pull/45471
### Rationale for this change Currently, unfortunately `MapScalar`/`Array` types are not deserialized into proper Python `dict`s, which is unfortunate since this breaks "roundtrips" from Python -> Arrow -> Python: ``` import pyarrow as pa schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))]) data = [{'x': {'a': 1}}] pa.RecordBatch.from_pylist(data, schema=schema).to_pylist() # [{'x': [('a', 1)]}] ``` This is especially bad when storing TiBs of deeply nested data (think of lists in structs in maps...) that were created from Python and serialized into Arrow/Parquet, since they can't be read in again with native `pyarrow` methods without doing extremely ugly and computationally costly workarounds. ### What changes are included in this PR? A new parameter `maps_as_pydicts` is introduced to `to_pylist`, `to_pydict`, `as_py` which will allow proper roundtrips: ``` import pyarrow as pa schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))]) data = [{'x': {'a': 1}}] pa.RecordBatch.from_pylist(data, schema=schema).to_pylist(maps_as_pydicts=True) # [{'x': {'a': 1}}] ``` ### Are these changes tested? Yes. There are tests for `to_pylist` and `to_pydict` included for `pyarrow.Table`, whilst low-level `MapScalar` and especially a nesting with `ListScalar` and `StructScalar` is tested. Also, duplicate keys now should throw an error, which is also tested for. ### Are there any user-facing changes? No callsites should be broken, simlpy a new kwarg is added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org