Zsolt Kegyes-Brassai created ARROW-16320:
--------------------------------------------

             Summary: Dataset re-partitioning consumes considerable amount of 
memory
                 Key: ARROW-16320
                 URL: https://issues.apache.org/jira/browse/ARROW-16320
             Project: Apache Arrow
          Issue Type: Improvement
    Affects Versions: 7.0.0
            Reporter: Zsolt Kegyes-Brassai


A short background: I was trying to create a dataset from a big pile of csv 
files (couple of hundreds). In first step the csv were parsed and saved to 
parquet files because there were many inconsistencies between csv files. In a 
consequent step the dataset was re-partitioned using one column (code_key).

 
{code:java}
new_dataset <- open_dataset(
  temp_parquet_folder, 
  format = "parquet",
  unify_schemas = TRUE
  )
new_dataset |> 
  group_by(code_key) |> 
  write_dataset(
    folder_repartitioned_dataset, 
    format = "parquet"
  )
{code}
 

This re-partitioning consumed a considerable amount of memory (5 GB). 
 * Is this a normal behavior?  Or a bug?
 * Is there any rule of thumb to estimate the memory requirement for a dataset 
re-partitioning? (it’s important when scaling up this approach)

The drawback is that this memory space is not freed up after the 
re-partitioning  (I am using RStudio). 
The {{gc()}} useless in this situation. And there is no any associated object 
(to the repartitioning) in the {{R}} environment which can be removed from 
memory (using the {{rm()}} function).
 * How one can regain this memory space used by re-partitioning?

The rationale behind choosing the dataset re-partitioning: if my understanding 
is correct,  in the current arrow version the append is not working when 
writing parquet files/datasets. (the original csv files were partly partitioned 
according to a different variable)

Can you recommend any better approach?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to