alamb commented on issue #7973: URL: https://github.com/apache/arrow-rs/issues/7973#issuecomment-3124301611
> [@alamb](https://github.com/alamb) setting DataType::LargeUt8 (which is large_string() in pyarrow) does fix it, but this still is a arrow-rs bug AFAIS, since both pyarrow and duckdb can read the file just fine without changing the schema. I assume they do some internal splitting into smaller batches to not run into integer overflows. Yes I agree this is a bug in arrow-rs (actually I think it is a bug in the parquet reader which is part of this repo) What should happen is that when reading data into a Utf8 column, if `batch_size` records can't be read into the target StringArray without overflowing the offesets (aka more than 2GB) fewer than `batch_size` records should be read. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org