jonded94 commented on PR #45471:
URL: https://github.com/apache/arrow/pull/45471#issuecomment-2648586525

   > While this is not a bad idea in itself, it seems like the roundtripping 
concern could be solved more efficiently by making from_pylist accept a list of 
tuples for map fields.
   
   Let me clarify what this is about. Map fields are already createable with 
`from_pylist` by using list of tuples, as I show in the tests I added. Even the 
code in my initial message can show this. Fundamentally, it's about adding 
opt-in behaviour to `to_pylist` to arrive at a functionality one would expect 
from a Python perspective:
   
   ```
   data = [{'x': {'a': 1}}]
   pa.RecordBatch.from_pylist(data, schema=schema).to_pylist()
   ^---------------------------------------------^
     this works fine, data will properly encoded in the Arrow way of encoding 
Maps
                                                   ^---------^
                                                   this will give lists of 
tuples instead of dicts 
   ```
   
   You can use `data = [{'x': [('a', 1)]}]` here too, this will yield the same 
`RecordBatch`. This then of course *technically* would qualify as a proper 
"roundtrip", but this is not what this issue is about, it's about deserializing 
Map Arrow types as the ~"expected" Python equivalent, at least as an opt-in 
method such as `pandas` already supports for some couple of years now (shown in 
the [linked Github issue](https://github.com/apache/arrow/issues/39010)).
   
   > Please note that from_pylist and to_pylist are quite costly in themselves. 
   
   Yes, but this is part of a very large distributed machine learning setup, 
where relatively intricate filters applied on deeply nested list/struct/map 
columns. The compute of the actual machine learning outclasses the compute one 
has to do to deserialize Python objects by many orders of magnitude.
   
   For pure data queries, we would not use bare Python objects of course.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to