JazJaz426 opened a new issue, #44944:
URL: https://github.com/apache/arrow/issues/44944
### Describe the usage question you have. Please include as many useful
details as possible.
Hi folks,
I read online that in PyArrow a string column would have a column-level size
limit of 2GB. However, in my work I noticed this doesn't hold.
```
def some_function(
self, raw_table: pa.Table,
):
schema = raw_table.schema
df = pl.DataFrame(raw_table)
```
In the code above, the table `raw_table` has some `document` column with
size over 2GB and I used `sum(buf.size if buf is not None else 0 for buf in
arrow_array.buffers()` for size checking. But when I check the schema it says
that column is String type.
I later cast it to polar then cast back to arrow, which automatically turns
all string to large string due to polar's default setting. Trying to cast the
`document` column back to string type but it got casting error and it always
stop at some fixed number of rows. I calculated the size processed and its
roughly 2GB.
So considering both cases, pretty sure the column is def over 2GB and
there's no calculation error in the first place. However super curious why it's
showing string type in the first place? Is there's something subtle I'm not
aware of?
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]