nbc commented on issue #41057: URL: https://github.com/apache/arrow/issues/41057#issuecomment-2047156376
I store on local fs. I thought it was related to R arrow but it's not. I've just tried the same partitioning with python and it gave the same result : very low mean row group size (around 730) : Source file : https://static.data.gouv.fr/resources/base-sirene-des-entreprises-et-de-leurs-etablissements-siren-siret/20231214-130548/stocketablissement-utf8.parquet ``` import pyarrow.dataset as ds import pyarrow as pa ds.write_dataset(ds.dataset("stocketablissement-utf8.parquet"), base_dir='stock_py_32', format='parquet', partitioning = ds.partitioning( schema=pa.schema([('typeVoieEtablissement', pa.utf8())]), flavor='hive' )) ``` ``` > parquet_metadata("stocketablissement-utf8.parquet") |> summarise(mean = mean(row_group_num_rows), min = min(row_group_num_rows), max = max(row_group_num_rows)) # A tibble: 1 × 3 mean min max <dbl> <dbl> <dbl> 1 123753. 79292 124927 > parquet_metadata("stock_py") |> summarise(mean = mean(row_group_num_rows), min = min(row_group_num_rows), max = max(row_group_num_rows)) # A tibble: 1 × 3 mean min max <dbl> <dbl> <dbl> 1 727. 1 22805 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
