nicki-dese opened a new issue, #39682: URL: https://github.com/apache/arrow/issues/39682
### Describe the bug, including details regarding any error messages, version, and platform. read_parquet() is giving the following error with large parquet files: > Capacity error: array cannot contain more than 2147483646 bytes, have 2147489180 Versions etc from sessionInfo: - arrow 14.0.0.2 - R version 4.3.0 (2023-04-21 ucrt) - Platform: x86_64-w64-ming32/x64 - Windows 11 x64 (build 22621) Descriptive info on example problematic table, with two columns: - 140 million rows. - id: large_string, 4.2 Gb - state: int_32, 0.5 Gb The id is a hashed string, 24 characters long. It is not practical to change it, as it's the joining key. Note, the data above is stored as a data.table in R and left that way when saving it with write_parquet(). But I've converted it to an arrow table for the above descriptive stats, because I thought they'd be more useful to you! Other relevant information: - The large parquet files were created with arrow::write_parquet() - The same files previously opened with an earlier version of read_parquet() (unfortunately I'm not sure which version, but it was working late November/early December, we work in a closed environment and use Posit Package manager, VMs rebuild every 30 days, so it would have been a fairly recent version) - I've duplicated the error, and it still occurs with newly created large parquet files, such as the one described above - Loading the same files with open_dataset() works. However, our team uses targets, which implicitly calls read_parquet, so this bug has unfortunately efffected many of our workflows. Note: I haven't been able to roll back to an earlier version of arrow - because we only have earlier source versions and not binaries and I'm using windows, I get libarrow errors. If there is a work around for this please let me know. ### Component(s) Parquet, R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org