nbc commented on issue #41057:
URL: https://github.com/apache/arrow/issues/41057#issuecomment-2041507202

   To reproduce this problem, you can use code below.
   
   ```
   library(arrow)
   library(dplyr)
   
   
download.file("https://static.data.gouv.fr/resources/base-sirene-des-entreprises-et-de-leurs-etablissements-siren-siret/20231214-130548/stocketablissement-utf8.parquet";,
 "stocketablissement-utf8.parquet")
   
   # create a partitioned dataset with default min_rows_per_group
   open_dataset("stocketablissement-utf8.parquet") |>
     mutate(code = str_sub(activitePrincipaleEtablissement, 1L, 2L)) |>
     write_dataset("stock_ds")
   
   # create a partitioned dataset with min_rows_per_group at 32000L
   open_dataset("stocketablissement-utf8.parquet") |>
     mutate(code = str_sub(activitePrincipaleEtablissement, 1L, 2L)) |>
     write_dataset("stock_ds_32000", min_rows_per_group = 32000L)
   ```
   
   I check duration and memory usage for this request :
   
   ```
    open_dataset(input) |>
       filter(dateCreationEtablissement < "2018-02-07" & dateDebut > 
"2020-02-02") |>
       count() |>
       collect()
   ```
   
   The result is :
   
   ```
    source                                 duration max_mem
   1 stocketablissement-utf8.parquet        3.14    2.17
   2 stock_ds                              15.9     6.54
   3 stock_ds_32000                         3.38    2.83
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to