[I] Column with over 2GB size limit but still identified as String in schema [arrow]

via GitHub Wed, 04 Dec 2024 16:07:08 -0800


JazJaz426 opened a new issue, #44944:
URL: https://github.com/apache/arrow/issues/44944


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Hi folks,
   
   I read online that in PyArrow a string column would have a column-level size 
limit of 2GB. However, in my work I noticed this doesn't hold. 
   
   ```
   def some_function(
           self, raw_table: pa.Table, 
       ):
           schema = raw_table.schema
           df = pl.DataFrame(raw_table)
   
   ```
   
   In the code above, the table `raw_table` has some `document` column with 
size over 2GB and I used `sum(buf.size if buf is not None else 0 for buf in 
arrow_array.buffers()` for size checking. But when I check the schema it says 
that column is String type. 
   
   I later cast it to polar then cast back to arrow, which automatically turns 
all string to large string due to polar's default setting. Trying to cast the 
`document` column back to string type but it got casting error and it always 
stop at some fixed number of rows. I calculated the size processed and its 
roughly 2GB. 
   
   So considering both cases, pretty sure the column is def over 2GB and 
there's no calculation error in the first place. However super curious why it's 
showing string type in the first place?  Is there's something subtle I'm not 
aware of?
   
   
   
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Column with over 2GB size limit but still identified as String in schema [arrow]

Reply via email to