[
https://issues.apache.org/jira/browse/ARROW-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528661#comment-17528661
]
Zsolt Kegyes-Brassai commented on ARROW-16320:
----------------------------------------------
Hi [~westonpace]
I tried to create a reproducible example.
In the first step I created a dummy dataset wit nearly 100 M rows, having
different column types and missing data.
When writing this dataset to a parquet file I realized, that even the
{{write_parquet()}} consumes a large amount of memory which is not returned
back.
Here is the data generation part:
{code:java}
library(tidyverse)
n = 99e6 + as.integer(1e6 * runif(n = 1))
# n = 1000
a =
tibble(
key1 = sample(datasets::state.abb, size = n, replace = TRUE),
key2 = sample(datasets::state.name, size = n, replace = TRUE),
subkey1 = sample(LETTERS, size = n, replace = TRUE),
subkey2 = sample(letters, size = n, replace = TRUE),
value1 = runif(n = n),
value2 = as.integer(1000 * runif(n = n)),
time = as.POSIXct(1e8 * runif(n = n), tz = "UTC", origin = "2020-01-01")
) |>
mutate(
subkey1 = if_else(key1 %in% c("WA", "WV", "WI", "WY"),
subkey1, NA_character_),
subkey2 = if_else(key2 %in% c("Washington", "West Virginia", "Wisconsin",
"Wyoming"),
subkey2, NA_character_),
)
lobstr::obj_size(a)
#> 5,177,583,640 B
{code}
and the memory utilization after the dataset creation
!100m_1_create.jpg!
and writing to *{{rds}}* file
{code:java}
readr::write_rds(a, here::here("db", "test100m.rds")){code}
no visible memory utilization increase
!100m_2_rds.jpg!
and writing to *parquet* file
{code:java}
arrow::write_parquet(a, here::here("db", "test100m.parquet")){code}
there is a drastic increase in memory utilization 10.6 GB -> 15 GB - just for
writing the file
!100m_3_parquet.jpg!
+This memory amount consumed during writing the parquet file was not returned
back even after 15 minutes.+
My biggest concern is that the ability to handle datasets larger than the
available memory seems increasingly remote.
I consider that this is a critical bug, but it might happen that is affecting
only me… as I don’t have possibility to test elsewhere.
> Dataset re-partitioning consumes considerable amount of memory
> --------------------------------------------------------------
>
> Key: ARROW-16320
> URL: https://issues.apache.org/jira/browse/ARROW-16320
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: 7.0.0
> Reporter: Zsolt Kegyes-Brassai
> Priority: Minor
> Attachments: 100m_1_create.jpg, 100m_2_rds.jpg, 100m_3_parquet.jpg,
> Rgui_mem.jpg, Rstudio_env.jpg, Rstudio_mem.jpg
>
>
> A short background: I was trying to create a dataset from a big pile of csv
> files (couple of hundreds). In first step the csv were parsed and saved to
> parquet files because there were many inconsistencies between csv files. In a
> consequent step the dataset was re-partitioned using one column (code_key).
>
> {code:java}
> new_dataset <- open_dataset(
> temp_parquet_folder,
> format = "parquet",
> unify_schemas = TRUE
> )
> new_dataset |>
> group_by(code_key) |>
> write_dataset(
> folder_repartitioned_dataset,
> format = "parquet"
> )
> {code}
>
> This re-partitioning consumed a considerable amount of memory (5 GB).
> * Is this a normal behavior? Or a bug?
> * Is there any rule of thumb to estimate the memory requirement for a
> dataset re-partitioning? (it’s important when scaling up this approach)
> The drawback is that this memory space is not freed up after the
> re-partitioning (I am using RStudio).
> The {{gc()}} useless in this situation. And there is no any associated object
> (to the repartitioning) in the {{R}} environment which can be removed from
> memory (using the {{rm()}} function).
> * How one can regain this memory space used by re-partitioning?
> The rationale behind choosing the dataset re-partitioning: if my
> understanding is correct, in the current arrow version the append is not
> working when writing parquet files/datasets. (the original csv files were
> partly partitioned according to a different variable)
> Can you recommend any better approach?
--
This message was sent by Atlassian Jira
(v8.20.7#820007)