jorisvandenbossche commented on issue #44164:
URL: https://github.com/apache/arrow/issues/44164#issuecomment-2364166949

   
   
   @Aryik if it are the same strings, then combining the chunks in the 
non-nested case should also error.
   
   Could you provide a reproducible example for that? (and ideally one that 
doesn't require 150GB memory to create it, as the actual data should only need 
to be a bit bigger than 2GB to trigger the error).
   
   Example showing an error in both cases:
   
   ```python
   import string
   import numpy as np
   
   import pyarrow as pa
   import pyarrow.compute as pc
   
   data = ["".join(np.random.choice(list(string.ascii_letters), n)) for n in 
np.random.randint(10, 500, size=10_000)]
   
   # will create a chunked array because data don't fit in a single array
   chunked_array = pa.array(data * 1000)
   nested_chunked_array = pc.make_struct(chunked_array, 
field_names=["string_field"])
   ```
   
   with this example (where the string data > 2GB), I get the error for both 
plain string type as for nested in a struct:
   
   ```python
   In [24]: nested_chunked_array.nbytes
   Out[24]: 2596156000
   
   In [25]: nested_chunked_array.combine_chunks()
   ...
   ArrowInvalid: offset overflow while concatenating arrays
   
   In [26]: chunked_array.combine_chunks()
   ...
   ArrowInvalid: offset overflow while concatenating arrays, consider casting 
input from `string` to `large_string` first.
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to