[jira] [Commented] (ARROW-16320) Dataset re-partitioning consumes considerable amount of memory

Zsolt Kegyes-Brassai (Jira) Wed, 27 Apr 2022 03:09:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528712#comment-17528712
 ]


Zsolt Kegyes-Brassai commented on ARROW-16320:
----------------------------------------------

And here is the result of the reading back these files.

 
{code:java}
a = readr::read_rds(here::here("db", "test100m.rds"))
lobstr::obj_size(a)
#> 5,177,583,640 B{code}
 

!100m_4_read_rds.jpg!

 
{code:java}
a = arrow::read_parquet(here::here("db", "test100m.parquet"))
lobstr::obj_size(a)
#> 796,553,696 B{code}
 

!100m_5_read-parquet.jpg!

This time there is no considerable difference in the memory utilization.

It’s a bit hard for me to understand when additional memory is used for parquet 
activities, but more important when this memory amount is returned and when is 
not (and what can trigger it).   

Sorry, I am a bit puzzled. It might happen that this is not an bug, just lack 
in my understanding.

> Dataset re-partitioning consumes considerable amount of memory
> --------------------------------------------------------------
>
>                 Key: ARROW-16320
>                 URL: https://issues.apache.org/jira/browse/ARROW-16320
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 7.0.0
>            Reporter: Zsolt Kegyes-Brassai
>            Priority: Minor
>         Attachments: 100m_1_create.jpg, 100m_2_rds.jpg, 100m_3_parquet.jpg, 
> 100m_4_read_rds.jpg, 100m_5_read-parquet.jpg, Rgui_mem.jpg, Rstudio_env.jpg, 
> Rstudio_mem.jpg
>
>
> A short background: I was trying to create a dataset from a big pile of csv 
> files (couple of hundreds). In first step the csv were parsed and saved to 
> parquet files because there were many inconsistencies between csv files. In a 
> consequent step the dataset was re-partitioned using one column (code_key).
>  
> {code:java}
> new_dataset <- open_dataset(
>   temp_parquet_folder, 
>   format = "parquet",
>   unify_schemas = TRUE
>   )
> new_dataset |> 
>   group_by(code_key) |> 
>   write_dataset(
>     folder_repartitioned_dataset, 
>     format = "parquet"
>   )
> {code}
>  
> This re-partitioning consumed a considerable amount of memory (5 GB). 
>  * Is this a normal behavior?  Or a bug?
>  * Is there any rule of thumb to estimate the memory requirement for a 
> dataset re-partitioning? (it’s important when scaling up this approach)
> The drawback is that this memory space is not freed up after the 
> re-partitioning  (I am using RStudio). 
> The {{gc()}} useless in this situation. And there is no any associated object 
> (to the repartitioning) in the {{R}} environment which can be removed from 
> memory (using the {{rm()}} function).
>  * How one can regain this memory space used by re-partitioning?
> The rationale behind choosing the dataset re-partitioning: if my 
> understanding is correct,  in the current arrow version the append is not 
> working when writing parquet files/datasets. (the original csv files were 
> partly partitioned according to a different variable)
> Can you recommend any better approach?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16320) Dataset re-partitioning consumes considerable amount of memory

Reply via email to