[PR] GH-39010: [Python] Introduce `maps_as_pydicts` parameter for `to_pylist`, `to_pydict`, `as_py` [arrow]

via GitHub Sun, 09 Feb 2025 14:54:18 -0800


jonded94 opened a new pull request, #45471:
URL: https://github.com/apache/arrow/pull/45471


   ### Rationale for this change
   
   Currently, unfortunately `MapScalar`/`Array` types are not deserialized into 
proper Python `dict`s, which is unfortunate since this breaks "roundtrips" from 
Python -> Arrow -> Python:
   
   ```
   import pyarrow as pa
   
   schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))])
   data = [{'x': {'a': 1}}]
   pa.RecordBatch.from_pylist(data, schema=schema).to_pylist()
   # [{'x': [('a', 1)]}]
   ```
   
   This is especially bad when storing TiBs of deeply nested data (think of 
lists in structs in maps...) that were created from Python and serialized into 
Arrow/Parquet, since they can't be read in again with native `pyarrow` methods 
without doing extremely ugly and computationally costly workarounds.
   
   ### What changes are included in this PR?
   
   A new parameter `maps_as_pydicts` is introduced to `to_pylist`, `to_pydict`, 
`as_py` which will allow proper roundtrips:
   
   ```
   import pyarrow as pa
   
   schema = pa.schema([pa.field('x', pa.map_(pa.string(), pa.int64()))])
   data = [{'x': {'a': 1}}]
   pa.RecordBatch.from_pylist(data, 
schema=schema).to_pylist(maps_as_pydicts=True)
   # [{'x': {'a': 1}}]
   ```
   
   ### Are these changes tested?
   
   Yes. There are tests for `to_pylist` and `to_pydict` included for 
`pyarrow.Table`, whilst low-level `MapScalar` and especially a nesting with 
`ListScalar` and `StructScalar` is tested.
   
   Also, duplicate keys now should throw an error, which is also tested for.
   
   ### Are there any user-facing changes?
   
   No callsites should be broken, simlpy a new kwarg is added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] GH-39010: [Python] Introduce `maps_as_pydicts` parameter for `to_pylist`, `to_pydict`, `as_py` [arrow]

Reply via email to