westonpace commented on issue #32439:
URL: https://github.com/apache/arrow/issues/32439#issuecomment-1588196555

   The issues is going to happen anytime a single string column ends up with 
more than 2^31 characters.  So, in OPs reproduction the column `square` has 161 
characters per string and 800,000 * 24 strings which is `3,091,200,000` 
characters.  2^31 is `2,147,483,648`.  At this point we have to split the 
resulting array into chunks (or use the large_string data type but that has 
issues of its own).
   
   This "breaking unexpectedly large columns into chunks" behavior is rather 
tricky and it appears we are doing something wrong when working with lists of 
struct arrays.  Here's a compact reproducer (that only has 3 rows):
   
   ```
   import pyarrow as pa
   import pandas as pd
   
   x = "0" * 1000000000
   df = pd.DataFrame({"strings": [x, x, x]})
   tab = pa.Table.from_pandas(df)
   print(tab.column(0).num_chunks)
   
   struct = {"struct_field": x}
   df = pd.DataFrame({"structs": [struct, struct, struct]})
   tab = pa.Table.from_pandas(df)
   print(tab.column(0).num_chunks)
   
   lists = [x]
   df = pd.DataFrame({"lists": [lists, lists, lists]})
   tab = pa.Table.from_pandas(df)
   print(tab.column(0).num_chunks)
   
   los = [struct]
   df = pd.DataFrame({"los": [los, los, los]})
   tab = pa.Table.from_pandas(df)
   print(tab.column(0).num_chunks)
   ```
   
   It seems the struct array has length 3.  Meanwhile, it's child, the string 
array, has length 2 (because it had to be broken into 2 chunks.  The first 
chunk has the first 2 values and the second chunk has the third).
   
   So if someone wanted to investigate this I would recommend starting by 
looking at the conversion from pandas code and see how the struct array and 
list arrays are handling the case where their children is converted into 
multiple chunks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to