romainfrancois opened a new pull request #8122:
URL: https://github.com/apache/arrow/pull/8122


   I don't think this is about `shared_ptr_is_null()` as indicated in the jira 
issue: https://issues.apache.org/jira/browse/ARROW-9557 I guess profvis (or 
probably the underlying profiler) struggles with that case. 
   
   What happens though is that `$ReadTable()` first calls `$GetSchema()`: 
   
   ```r
   ReadTable = function(col_select = NULL) {
         col_select <- enquo(col_select)
         if (quo_is_null(col_select)) {
           shared_ptr(Table, parquet___arrow___FileReader__ReadTable1(self))
         } else {
           all_vars <- shared_ptr(Schema, 
parquet___arrow___FileReader__GetSchema(self))$names
           indices <- match(vars_select(all_vars, !!col_select), all_vars) - 1L
           shared_ptr(Table, parquet___arrow___FileReader__ReadTable2(self, 
indices))
         }
       }
   ```
   
   and that's expensive for some reason: 
   
   ``` r
   library(arrow)
   #> 
   #> Attaching package: 'arrow'
   #> The following object is masked from 'package:utils':
   #> 
   #>     timestamp
   library(vctrs)
   #> 
   #> Attaching package: 'vctrs'
   #> The following objects are masked from 'package:arrow':
   #> 
   #>     field, list_of
   library(purrr)
   
   df <- new_data_frame(
     map(set_names(1:4000), ~rnorm(50000))
   )
   tf <- tempfile()
   write_parquet(df, tf)
   
   reader <- ParquetFileReader$create(tf)
   
   parquet___arrow___FileReader__GetSchema <- 
arrow:::parquet___arrow___FileReader__GetSchema
   parquet___arrow___FileReader__ReadColumn <- 
arrow:::parquet___arrow___FileReader__ReadColumn
   
   system.time({
     for (i in 1:4000) {
       parquet___arrow___FileReader__GetSchema(reader)
     }
   })
   #>    user  system elapsed 
   #>  43.809   1.744  47.962
   
   system.time({
     for (i in 1:4000) {
       parquet___arrow___FileReader__ReadColumn(reader, i)
     }
   })
   #>    user  system elapsed 
   #>   3.035   2.448  10.606
   ```
   
   <sup>Created on 2020-09-07 by the [reprex 
package](https://reprex.tidyverse.org) (v0.3.0.9001)</sup>
   
   So we probably need a more complete R6 wrapper around 
`parquet::arrow::FileReader`. 
https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L107
   
   As a start, here is `$GetColumn()`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to