[
https://issues.apache.org/jira/browse/ARROW-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neal Richardson updated ARROW-13865:
------------------------------------
Component/s: C++
> [C++][R] Writing moderate-size parquet files of nested dataframes from R
> slows down/process hangs
> -------------------------------------------------------------------------------------------------
>
> Key: ARROW-13865
> URL: https://issues.apache.org/jira/browse/ARROW-13865
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, R
> Affects Versions: 5.0.0
> Reporter: John Sheffield
> Priority: Major
> Attachments: Screen Shot 2021-09-02 at 11.21.37 AM.png
>
>
> I observed a significant slowdown in parquet writes (and ultimately the
> process just hangs for minutes without completion) while writing
> moderate-size nested dataframes from R. I have replicated the issue on MacOS
> and Ubuntu so far.
>
> An example:
> ```
> testdf <- dplyr::tibble(
> id = uuid::UUIDgenerate(n = 5000),
> l1 = as.list(lapply(1:5000, (function( x ) runif(1000)))),
> l2 = as.list(lapply(1:5000, (function( x ) rnorm(1000))))
> )
> testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))
>
> # This works
> arrow::write_parquet(testdf_long, "testdf_long.parquet")
> # This write does not complete within a few minutes on my testing but throws
> no errors
> arrow::write_parquet(testdf, "testdf.parquet")
> ```
> I can't guess at why this is true, but the slowdown is closely tied to row
> counts:
> ```
> # screenshot attached; 12ms, 56ms, and 680ms respectively.
> microbenchmark::microbenchmark(
> arrow::write_parquet(testdf[1, ], "testdf.parquet"),
> arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
> arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
> times = 5
> )
> ```
> I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu
> is
> R version 4.0.5 (2021-03-31)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 20.04.3 LTS
> Matrix products: default
> BLAS/LAPACK:
> /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
> LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] arrow_5.0.0
> And sessionInfo for MacOS is:
> R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit)
> Running under: macOS Catalina 10.15.7 Matrix products: default BLAS:
> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
> LAPACK:
> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
> locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> attached base packages: [1] stats graphics grDevices utils datasets methods
> base other attached packages: [1] arrow_5.0.0
--
This message was sent by Atlassian Jira
(v8.3.4#803005)