kangakum36 commented on issue #46633:
URL: https://github.com/apache/arrow/issues/46633#issuecomment-2918122485

   Per my original description and the comment in `table.cc`, `CombineChunks` 
produces multiple chunks when combining said chunks would cause a buffer 
overflow in the offset vector of a binary column.
   
   Here is an example:
   ```
   import pyarrow as pa
   import os
   length = 1573741824
   rand_bytes = os.urandom(length)
   col_name = "col_1"
   table = pa.Table.from_arrays([pa.chunked_array([[rand_bytes], [rand_bytes]], 
type=pa.binary())], names=[col_name])
   print(f"Column {col_name} in original table has {table[col_name].num_chunks} 
chunks")
   
   combined_table = table.combine_chunks()
   
   print(f"Column {col_name} in combined table has 
{combined_table[col_name].num_chunks} chunks")
   ```
   This prints:
   ```
   Column col_1 in original table has 2 chunks
   Column col_1 in combined table has 2 chunks
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to