nicki-dese opened a new issue, #39682:
URL: https://github.com/apache/arrow/issues/39682

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   read_parquet() is giving the following error with large parquet files:
   
   > Capacity error: array cannot contain more than 2147483646 bytes, have 
2147489180
   
   
   Versions etc from sessionInfo:
   - arrow 14.0.0.2
   - R version  4.3.0 (2023-04-21 ucrt)
   - Platform: x86_64-w64-ming32/x64
   - Windows 11 x64 (build 22621)
   
   Descriptive info on example problematic table, with two columns:
   - 140 million rows. 
   - id: large_string, 4.2 Gb
   - state: int_32, 0.5 Gb 
   
   The id is a hashed string, 24 characters long. It is not practical to change 
it, as it's the joining key. 
   
   Note, the data above is stored as a data.table in R and left that way when 
saving it with write_parquet(). But I've converted it to an arrow table for the 
above descriptive stats, because I thought they'd be more useful to you!
   
   
   Other relevant information:
   - The large parquet files were created with arrow::write_parquet()
   - The same files previously opened with an earlier version of read_parquet()
   (unfortunately I'm not sure which version, but it was working late 
November/early December, we work in a closed environment and use Posit Package 
manager, VMs rebuild every 30 days, so it would have been a fairly recent 
version)
   - I've duplicated the error, and it still occurs with newly created large 
parquet files, such as the one described above
   - Loading the same files with open_dataset() works. However, our team uses 
targets, which implicitly calls read_parquet, so this bug has unfortunately 
efffected many of our workflows. 
   
   Note: I haven't been able to roll back to an earlier version of arrow -  
because we only have earlier source versions and not binaries and I'm using 
windows, I get libarrow errors. If there is a work around for this please let 
me know. 
   
   
   
   ### Component(s)
   
   Parquet, R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to