[jira] [Commented] (ARROW-16320) Dataset re-partitioning consumes considerable amount of memory

Weston Pace (Jira) Wed, 27 Apr 2022 21:05:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529181#comment-17529181
 ]


Weston Pace commented on ARROW-16320:
-------------------------------------

The writing behavior you described seemed odd so I modified your script a 
little (and added a memory print which, sadly, will only work on Linux):

{noformat}
> 
> print_rss <- function() {
+   print(grep("vmrss", readLines("/proc/self/status"), ignore.case=TRUE, 
value=TRUE))
+ }
> 
> n = 99e6 + as.integer(1e6 * runif(n = 1))
> a = 
+   tibble(
+     key1 = sample(datasets::state.abb, size = n, replace = TRUE),
+     key2 = sample(datasets::state.name, size = n, replace = TRUE),
+     subkey1 = sample(LETTERS, size = n, replace = TRUE),
+     subkey2 = sample(letters, size = n, replace = TRUE),
+     value1 = runif(n = n),
+     value2 = as.integer(1000 * runif(n = n)),
+     time = as.POSIXct(1e8 * runif(n = n), tz = "UTC", origin = "2020-01-01")
+   ) |> 
+   mutate(
+     subkey1 = if_else(key1 %in% c("WA", "WV", "WI", "WY"), 
+                       subkey1, NA_character_),
+     subkey2 = if_else(key2 %in% c("Washington", "West Virginia", "Wisconsin", 
"Wyoming"), 
+                       subkey2, NA_character_),
+   )
> lobstr::obj_size(a)
5,171,792,240 B
> print("Memory usage after creating the tibble")
[1] "Memory usage after creating the tibble"
> print_rss()
[1] "VmRSS:\t 5159276 kB"
> 
> 
> readr::write_rds(a, here::here("db", "test100m.rds"))
> print("Memory usage after writing rds")
[1] "Memory usage after writing rds"
> print_rss()
[1] "VmRSS:\t 5161776 kB"
> 
> 
> arrow::write_parquet(a, here::here("db", "test100m.parquet"))
> print("Memory usage after writing parquet")
[1] "Memory usage after writing parquet"
> print_rss()
[1] "VmRSS:\t 8990620 kB"
> Sys.sleep(5)
> print("And after sleeping 5 seconds")
[1] "And after sleeping 5 seconds"
> print_rss()
[1] "VmRSS:\t 8990620 kB"
> print(gc())
            used   (Mb) gc trigger    (Mb)   max used   (Mb)
Ncells    892040   47.7    1749524    93.5    1265150   67.6
Vcells 647980229 4943.7 1392905158 10627.1 1240800333 9466.6
> Sys.sleep(5)
> print("And again after a garbage collection and 5 more seconds")
[1] "And again after a garbage collection and 5 more seconds"
> print_rss()
[1] "VmRSS:\t 5377900 kB"
{noformat}

Summarizing...
{noformat}
Create table
~5.15GB RAM used
Write RDS
~5.16GB RAM used
Write Parquet
~9GB RAM used
Wait 5 seconds
~9GB RAM used
Run garbage collection
Wait 5 seconds
~5.38GB RAM used
{noformat}

This doesn't seem terribly ideal.  I think, after writing, some R objects are 
holding references (possibly transitively) to some shared pointers to record 
batches in C++.  When the garbage collection runs those R objects are destroyed 
and the shared pointers (and buffers) can be freed.



> Dataset re-partitioning consumes considerable amount of memory
> --------------------------------------------------------------
>
>                 Key: ARROW-16320
>                 URL: https://issues.apache.org/jira/browse/ARROW-16320
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 7.0.0
>            Reporter: Zsolt Kegyes-Brassai
>            Priority: Minor
>         Attachments: 100m_1_create.jpg, 100m_2_rds.jpg, 100m_3_parquet.jpg, 
> 100m_4_read_rds.jpg, 100m_5_read-parquet.jpg, Rgui_mem.jpg, Rstudio_env.jpg, 
> Rstudio_mem.jpg
>
>
> A short background: I was trying to create a dataset from a big pile of csv 
> files (couple of hundreds). In first step the csv were parsed and saved to 
> parquet files because there were many inconsistencies between csv files. In a 
> consequent step the dataset was re-partitioned using one column (code_key).
>  
> {code:java}
> new_dataset <- open_dataset(
>   temp_parquet_folder, 
>   format = "parquet",
>   unify_schemas = TRUE
>   )
> new_dataset |> 
>   group_by(code_key) |> 
>   write_dataset(
>     folder_repartitioned_dataset, 
>     format = "parquet"
>   )
> {code}
>  
> This re-partitioning consumed a considerable amount of memory (5 GB). 
>  * Is this a normal behavior?  Or a bug?
>  * Is there any rule of thumb to estimate the memory requirement for a 
> dataset re-partitioning? (it’s important when scaling up this approach)
> The drawback is that this memory space is not freed up after the 
> re-partitioning  (I am using RStudio). 
> The {{gc()}} useless in this situation. And there is no any associated object 
> (to the repartitioning) in the {{R}} environment which can be removed from 
> memory (using the {{rm()}} function).
>  * How one can regain this memory space used by re-partitioning?
> The rationale behind choosing the dataset re-partitioning: if my 
> understanding is correct,  in the current arrow version the append is not 
> working when writing parquet files/datasets. (the original csv files were 
> partly partitioned according to a different variable)
> Can you recommend any better approach?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16320) Dataset re-partitioning consumes considerable amount of memory

Reply via email to