romainfrancois opened a new pull request #8122: URL: https://github.com/apache/arrow/pull/8122
I don't think this is about `shared_ptr_is_null()` as indicated in the jira issue: https://issues.apache.org/jira/browse/ARROW-9557 I guess profvis (or probably the underlying profiler) struggles with that case. What happens though is that `$ReadTable()` first calls `$GetSchema()`: ```r ReadTable = function(col_select = NULL) { col_select <- enquo(col_select) if (quo_is_null(col_select)) { shared_ptr(Table, parquet___arrow___FileReader__ReadTable1(self)) } else { all_vars <- shared_ptr(Schema, parquet___arrow___FileReader__GetSchema(self))$names indices <- match(vars_select(all_vars, !!col_select), all_vars) - 1L shared_ptr(Table, parquet___arrow___FileReader__ReadTable2(self, indices)) } } ``` and that's expensive for some reason: ``` r library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp library(vctrs) #> #> Attaching package: 'vctrs' #> The following objects are masked from 'package:arrow': #> #> field, list_of library(purrr) df <- new_data_frame( map(set_names(1:4000), ~rnorm(50000)) ) tf <- tempfile() write_parquet(df, tf) reader <- ParquetFileReader$create(tf) parquet___arrow___FileReader__GetSchema <- arrow:::parquet___arrow___FileReader__GetSchema parquet___arrow___FileReader__ReadColumn <- arrow:::parquet___arrow___FileReader__ReadColumn system.time({ for (i in 1:4000) { parquet___arrow___FileReader__GetSchema(reader) } }) #> user system elapsed #> 43.809 1.744 47.962 system.time({ for (i in 1:4000) { parquet___arrow___FileReader__ReadColumn(reader, i) } }) #> user system elapsed #> 3.035 2.448 10.606 ``` <sup>Created on 2020-09-07 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0.9001)</sup> So we probably need a more complete R6 wrapper around `parquet::arrow::FileReader`. https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L107 As a start, here is `$GetColumn()` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org