Aryik commented on issue #44164:
URL: https://github.com/apache/arrow/issues/44164#issuecomment-2364308683
This isn't a reproducible example because it came from our production source
data, but here's some evidence for the difference in behavior with nested vs.
non-nested:
```
# additional_properties is a nested dict that contains strings
only_additional_properties = [
{"additional_properties": item["additional_properties"]} for item in
test_items
]
unnested_additional_properties = [item["additional_properties"] for item in
test_items]
nested_table, nested_chunks = get_arrow_table_and_chunks(
only_additional_properties, ["additional_properties"]
)
unnested_table, unnested_chunks = get_arrow_table_and_chunks(
unnested_additional_properties, unnested_additional_properties[0].keys()
)
nested_table.nbytes
# -> 3668663888
unnested_table.nbytes
# -> 3668663888
nested_table["additional_properties"].combine_chunks()
# ArrowInvalid: offset overflow ...
# This is one of the columns inside "additional_properties"
unnested_table["primaryimagesrc"].nbytes
# -> 349,666,429
# This works
unnested_table["primaryimagesrc"].combine_chunks()
```
I will try to come up with a reproducible example
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]