[
https://issues.apache.org/jira/browse/ARROW-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
John Sheffield updated ARROW-13865:
-----------------------------------
Description:
I observed a significant slowdown in parquet writes (and ultimately the process
just hangs for minutes without completion) while writing moderate-size nested
dataframes from R. I have replicated the issue on MacOS and Ubuntu so far.
An example:
```
testdf <- dplyr::tibble(
id = uuid::UUIDgenerate(n = 5000),
l1 = as.list(lapply(1:5000, (function( x ) runif(1000)))),
l2 = as.list(lapply(1:5000, (function( x ) rnorm(1000))))
)
testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))
# This works
arrow::write_parquet(testdf_long, "testdf_long.parquet")
# This write does not complete within a few minutes on my testing but throws
no errors
arrow::write_parquet(testdf, "testdf.parquet")
```
I can't guess at why this is true, but the slowdown is closely tied to row
counts:
```
# screenshot attached; 12ms, 56ms, and 680ms respectively.
microbenchmark::microbenchmark(
arrow::write_parquet(testdf[1, ], "testdf.parquet"),
arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
times = 5
)
```
I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arrow_5.0.0
And sessionInfo for MacOS is:
R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7 Matrix products: default BLAS:
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK:
/Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages: [1] stats graphics grDevices utils datasets methods
base other attached packages: [1] arrow_5.0.0
was:
I observed a significant slowdown in parquet writes (and ultimately the process
just hangs for minutes without completion) while writing moderate-size nested
dataframes from R. I have replicated the issue on MacOS and Ubuntu so far.
An example:
```
testdf <- dplyr::tibble(
id = uuid::UUIDgenerate(n = 5000),
l1 = as.list(lapply(1:5000, (function(x) runif(1000)))),
l2 = as.list(lapply(1:5000, (function(x) rnorm(1000))))
)
testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))
# This works
arrow::write_parquet(testdf_long, "testdf_long.parquet")
# This write does not complete within a few minutes on my testing but throws
no errors
arrow::write_parquet(testdf, "testdf.parquet")
```
I can't guess at why this is true, but the slowdown is closely tied to row
counts:
```
# screenshot attached; 12ms, 56ms, and 680ms respectively.
microbenchmark::microbenchmark(
arrow::write_parquet(testdf[1, ], "testdf.parquet"),
arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
times = 5
)
```
I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu is
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] arrow_5.0.0
And sessionInfo for MacOS is:
R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7 Matrix products: default BLAS:
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK:
/Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages: [1] stats graphics grDevices utils datasets methods
base other attached packages: [1] arrow_5.0.0
> Writing moderate-size parquet files of nested dataframes from R slows
> down/process hangs
> ----------------------------------------------------------------------------------------
>
> Key: ARROW-13865
> URL: https://issues.apache.org/jira/browse/ARROW-13865
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 5.0.0
> Reporter: John Sheffield
> Priority: Major
> Attachments: Screen Shot 2021-09-02 at 11.21.37 AM.png
>
>
> I observed a significant slowdown in parquet writes (and ultimately the
> process just hangs for minutes without completion) while writing
> moderate-size nested dataframes from R. I have replicated the issue on MacOS
> and Ubuntu so far.
>
> An example:
> ```
> testdf <- dplyr::tibble(
> id = uuid::UUIDgenerate(n = 5000),
> l1 = as.list(lapply(1:5000, (function( x ) runif(1000)))),
> l2 = as.list(lapply(1:5000, (function( x ) rnorm(1000))))
> )
> testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))
>
> # This works
> arrow::write_parquet(testdf_long, "testdf_long.parquet")
> # This write does not complete within a few minutes on my testing but throws
> no errors
> arrow::write_parquet(testdf, "testdf.parquet")
> ```
> I can't guess at why this is true, but the slowdown is closely tied to row
> counts:
> ```
> # screenshot attached; 12ms, 56ms, and 680ms respectively.
> microbenchmark::microbenchmark(
> arrow::write_parquet(testdf[1, ], "testdf.parquet"),
> arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
> arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
> times = 5
> )
> ```
> I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu
> is
> R version 4.0.5 (2021-03-31)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 20.04.3 LTS
> Matrix products: default
> BLAS/LAPACK:
> /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
> LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] arrow_5.0.0
> And sessionInfo for MacOS is:
> R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit)
> Running under: macOS Catalina 10.15.7 Matrix products: default BLAS:
> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
> LAPACK:
> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
> locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> attached base packages: [1] stats graphics grDevices utils datasets methods
> base other attached packages: [1] arrow_5.0.0
--
This message was sent by Atlassian Jira
(v8.3.4#803005)