Aryik commented on issue #44164:
URL: https://github.com/apache/arrow/issues/44164#issuecomment-2364308683

   This isn't a reproducible example because it came from our production source 
data, but here's some evidence for the difference in behavior with nested vs. 
non-nested:
   ```
   # additional_properties is a nested dict that contains strings
   only_additional_properties = [
       {"additional_properties": item["additional_properties"]} for item in 
test_items
   ]
   unnested_additional_properties = [item["additional_properties"] for item in 
test_items]
   nested_table, nested_chunks = get_arrow_table_and_chunks(
       only_additional_properties, ["additional_properties"]
   )
   unnested_table, unnested_chunks = get_arrow_table_and_chunks(
       unnested_additional_properties, unnested_additional_properties[0].keys()
   )
   nested_table.nbytes
   # -> 3668663888
   unnested_table.nbytes
   # -> 3668663888
   nested_table["additional_properties"].combine_chunks()
   # ArrowInvalid: offset overflow ...
   
   # This is one of the columns inside "additional_properties"
   unnested_table["primaryimagesrc"].nbytes  
   # -> 349,666,429
   
   # This works 
   unnested_table["primaryimagesrc"].combine_chunks()
   ```
   
   I will try to come up with a reproducible example


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to