Re: [I] Error when reading row group larger than 2GB (total string length per 8k row batch exceeds 2GB) [arrow-rs]

via GitHub Sun, 27 Jul 2025 03:46:37 -0700


alamb commented on issue #7973:
URL: https://github.com/apache/arrow-rs/issues/7973#issuecomment-3124301611


   > [@alamb](https://github.com/alamb) setting DataType::LargeUt8 (which is 
large_string() in pyarrow) does fix it, but this still is a arrow-rs bug AFAIS, 
since both pyarrow and duckdb can read the file just fine without changing the 
schema. I assume they do some internal splitting into smaller batches to not 
run into integer overflows.
   
   Yes I agree this is a bug in arrow-rs (actually I think it is a bug in the 
parquet reader which is part of this repo)
   
   What should happen is that when reading data into a Utf8 column, if 
`batch_size` records can't be read into the target StringArray without 
overflowing the offesets (aka more than 2GB) fewer than `batch_size` records 
should be read. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] Error when reading row group larger than 2GB (total string length per 8k row batch exceeds 2GB) [arrow-rs]

Reply via email to