jonded94 commented on PR #45471:
URL: https://github.com/apache/arrow/pull/45471#issuecomment-2648586525
> While this is not a bad idea in itself, it seems like the roundtripping
concern could be solved more efficiently by making from_pylist accept a list of
tuples for map fields.
Let me clarify what this is about. Map fields are already createable with
`from_pylist` by using list of tuples, as I show in the tests I added. Even the
code in my initial message can show this. Fundamentally, it's about adding
opt-in behaviour to `to_pylist` to arrive at a functionality one would expect
from a Python perspective:
```
data = [{'x': {'a': 1}}]
pa.RecordBatch.from_pylist(data, schema=schema).to_pylist()
^---------------------------------------------^
this works fine, data will properly encoded in the Arrow way of encoding
Maps
^---------^
this will give lists of
tuples instead of dicts
```
You can use `data = [{'x': [('a', 1)]}]` here too, this will yield the same
`RecordBatch`. This then of course *technically* would qualify as a proper
"roundtrip", but this is not what this issue is about, it's about deserializing
Map Arrow types as the ~"expected" Python equivalent, at least as an opt-in
method such as `pandas` already supports for some couple of years now (shown in
the [linked Github issue](https://github.com/apache/arrow/issues/39010)).
> Please note that from_pylist and to_pylist are quite costly in themselves.
Yes, but this is part of a very large distributed machine learning setup,
where relatively intricate filters applied on deeply nested list/struct/map
columns. The compute of the actual machine learning outclasses the compute one
has to do to deserialize Python objects by many orders of magnitude.
For pure data queries, we would not use bare Python objects of course.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]