kangakum36 commented on issue #46633: URL: https://github.com/apache/arrow/issues/46633#issuecomment-2918122485
Per my original description and the comment in `table.cc`, `CombineChunks` produces multiple chunks when combining said chunks would cause a buffer overflow in the offset vector of a binary column. Here is an example: ``` import pyarrow as pa import os length = 1573741824 rand_bytes = os.urandom(length) col_name = "col_1" table = pa.Table.from_arrays([pa.chunked_array([[rand_bytes], [rand_bytes]], type=pa.binary())], names=[col_name]) print(f"Column {col_name} in original table has {table[col_name].num_chunks} chunks") combined_table = table.combine_chunks() print(f"Column {col_name} in combined table has {combined_table[col_name].num_chunks} chunks") ``` This prints: ``` Column col_1 in original table has 2 chunks Column col_1 in combined table has 2 chunks ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org