nbc commented on issue #41057:
URL: https://github.com/apache/arrow/issues/41057#issuecomment-2041507202
To reproduce this problem, you can use code below.
```
library(arrow)
library(dplyr)
download.file("https://static.data.gouv.fr/resources/base-sirene-des-entreprises-et-de-leurs-etablissements-siren-siret/20231214-130548/stocketablissement-utf8.parquet",
"stocketablissement-utf8.parquet")
# create a partitioned dataset with default min_rows_per_group
open_dataset("stocketablissement-utf8.parquet") |>
mutate(code = str_sub(activitePrincipaleEtablissement, 1L, 2L)) |>
write_dataset("stock_ds")
# create a partitioned dataset with min_rows_per_group at 32000L
open_dataset("stocketablissement-utf8.parquet") |>
mutate(code = str_sub(activitePrincipaleEtablissement, 1L, 2L)) |>
write_dataset("stock_ds_32000", min_rows_per_group = 32000L)
```
I check duration and memory usage for this request :
```
open_dataset(input) |>
filter(dateCreationEtablissement < "2018-02-07" & dateDebut >
"2020-02-02") |>
count() |>
collect()
```
The result is :
```
source duration max_mem
1 stocketablissement-utf8.parquet 3.14 2.17
2 stock_ds 15.9 6.54
3 stock_ds_32000 3.38 2.83
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]