[I] read_parquet is very slow, compared to nanoparquet::read_parquet [arrow]

via GitHub Tue, 04 Nov 2025 17:13:37 -0800


joeramirez opened a new issue, #48057:
URL: https://github.com/apache/arrow/issues/48057


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Hi,
   I have a single parquet file that is ~130MB (1500 rows x 18000 columns).  I 
want to use this file as a database where the first column is a Date series and 
the rest of the columns are stock prices with the column names being the stock 
tickers.
   I'm using arrow::read_parquet like so:
   `
   res <- arrow::read_parquet(
       file = pqpath, col_select = c("Date", dplyr::any_of(tics)))
   `
   That takes about 12 seconds to perform.  To test, I'm using a local parquet 
file for pqpath.
   Now if i use nanoparquet instead like this:
   `
   res <- nanoparquet::read_parquet(
       file = pqpath, col_select = c("Date", tics))
   `
   it completes nearly instantly with the correct data.  Why is nanoparquet so 
much faster (or arrow so slow)?
   The parquet file was created using arrow::write_parquet() with no special 
options.
   I would switch over to nanoparquet, but I need to read files from AWS S3, so 
arrow is preferred in that regard.
   I'm using arrow version 21.0.0.1 and nanoparquet version 0.4.2
   
   Thanks for any insight.
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] read_parquet is very slow, compared to nanoparquet::read_parquet [arrow]

Reply via email to