nbc commented on issue #41057:
URL: https://github.com/apache/arrow/issues/41057#issuecomment-2047156376

   I store on local fs.
   
   I thought it was related to R arrow but it's not. I've just tried the same 
partitioning with python and it gave the same result : very low mean row group 
size (around 730) : 
   
   Source file : 
https://static.data.gouv.fr/resources/base-sirene-des-entreprises-et-de-leurs-etablissements-siren-siret/20231214-130548/stocketablissement-utf8.parquet
   
   ```
   import pyarrow.dataset as ds
   import pyarrow as pa
   
   ds.write_dataset(ds.dataset("stocketablissement-utf8.parquet"),
                    base_dir='stock_py_32',
                    format='parquet',
                    partitioning = ds.partitioning(
                        schema=pa.schema([('typeVoieEtablissement', 
pa.utf8())]),
                        flavor='hive'
                    ))
   ```
   
   ```
   > parquet_metadata("stocketablissement-utf8.parquet") |> summarise(mean = 
mean(row_group_num_rows), min = min(row_group_num_rows), max = 
max(row_group_num_rows))
   # A tibble: 1 × 3
        mean   min    max
       <dbl> <dbl>  <dbl>
   1 123753. 79292 124927
   
   > parquet_metadata("stock_py") |> summarise(mean = mean(row_group_num_rows), 
min = min(row_group_num_rows), max = max(row_group_num_rows))
   # A tibble: 1 × 3
      mean   min   max
     <dbl> <dbl> <dbl>
   1  727.     1 22805
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to