jorisvandenbossche commented on issue #44164: URL: https://github.com/apache/arrow/issues/44164#issuecomment-2364166949
@Aryik if it are the same strings, then combining the chunks in the non-nested case should also error. Could you provide a reproducible example for that? (and ideally one that doesn't require 150GB memory to create it, as the actual data should only need to be a bit bigger than 2GB to trigger the error). Example showing an error in both cases: ```python import string import numpy as np import pyarrow as pa import pyarrow.compute as pc data = ["".join(np.random.choice(list(string.ascii_letters), n)) for n in np.random.randint(10, 500, size=10_000)] # will create a chunked array because data don't fit in a single array chunked_array = pa.array(data * 1000) nested_chunked_array = pc.make_struct(chunked_array, field_names=["string_field"]) ``` with this example (where the string data > 2GB), I get the error for both plain string type as for nested in a struct: ```python In [24]: nested_chunked_array.nbytes Out[24]: 2596156000 In [25]: nested_chunked_array.combine_chunks() ... ArrowInvalid: offset overflow while concatenating arrays In [26]: chunked_array.combine_chunks() ... ArrowInvalid: offset overflow while concatenating arrays, consider casting input from `string` to `large_string` first. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
