joeramirez opened a new issue, #48057:
URL: https://github.com/apache/arrow/issues/48057
### Describe the bug, including details regarding any error messages,
version, and platform.
Hi,
I have a single parquet file that is ~130MB (1500 rows x 18000 columns). I
want to use this file as a database where the first column is a Date series and
the rest of the columns are stock prices with the column names being the stock
tickers.
I'm using arrow::read_parquet like so:
`
res <- arrow::read_parquet(
file = pqpath, col_select = c("Date", dplyr::any_of(tics)))
`
That takes about 12 seconds to perform. To test, I'm using a local parquet
file for pqpath.
Now if i use nanoparquet instead like this:
`
res <- nanoparquet::read_parquet(
file = pqpath, col_select = c("Date", tics))
`
it completes nearly instantly with the correct data. Why is nanoparquet so
much faster (or arrow so slow)?
The parquet file was created using arrow::write_parquet() with no special
options.
I would switch over to nanoparquet, but I need to read files from AWS S3, so
arrow is preferred in that regard.
I'm using arrow version 21.0.0.1 and nanoparquet version 0.4.2
Thanks for any insight.
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]