[
https://issues.apache.org/jira/browse/ARROW-14020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416718#comment-17416718
]
Jonathan Keane commented on ARROW-14020:
----------------------------------------
Thanks for the report!
I did a bit of profiling, and I think I see where this is slowing down: when
presented with a list column, {arrow} inspects the attributes of the list
column (and specifically of each element in the column) in order to save that
as metadata (even though these list columns don't even have any additional
attributes!). [1] We've actually already disabled this metadata when
interacting with datasets (ARROW-13189), and it's possible we should take out
this saving entirely (though we probably need to still provide an option for
doing it since people might depend on being able to save that or read it in; or
we could streamline that process to improve it.
On the downstream impacts on SF: We've done a little bit of exploration on
making the experience saving SF columns better (eg ARROW-12542), though we
haven't gotten something perfect just yet. For now, at least, we would
recommend if you're saving large amounts of SF data and running into issues
like this, checkout the [{sfarrow}|https://github.com/wcjochem/sfarrow] package
which is similar to the workaround you proposed, using well-known binary for
encoding the SF column(s), along with some utilities/helper functions.
[1] - Profiling showed that
https://github.com/apache/arrow/blob/b599a0539fffd1bb226ebce83e2f035d3080ac41/r/R/metadata.R#L160
takes a large amount of the time converting to a table, and inside of that
https://github.com/apache/arrow/blob/b599a0539fffd1bb226ebce83e2f035d3080ac41/r/R/metadata.R#L129
takes about between a quarter and half of that time.
> [R] Writing datafames with list columns is slow and scales poorly with
> nesting level
> ------------------------------------------------------------------------------------
>
> Key: ARROW-14020
> URL: https://issues.apache.org/jira/browse/ARROW-14020
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 5.0.0
> Environment: Windows 10 x64
> Reporter: Miles McBain
> Priority: Major
>
> Writing data frames that contain list columns seems much slower than expected:
> ``` r
> library(tidyverse)
> #> Warning: package 'tidyverse' was built under R version 4.1.1
> #> Warning: package 'tibble' was built under R version 4.1.1
> #> Warning: package 'readr' was built under R version 4.1.1
> library(arrow)
> #> Warning: package 'arrow' was built under R version 4.1.1
> #>
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #>
> #> timestamp
> dummy <- tibble(
> points = rep(list(seq(6)), 2e6),
> index = seq(2e6)
> )
> # very slooooooow
> system.time(write_parquet(dummy, "dummy.parquet"))
> #> user system elapsed
> #> 55.64 0.11 55.98
> dummy_txt <- mutate(dummy, points = map_chr(points, deparse))
> # orders of magnitude faster
> system.time(write_parquet(dummy_txt, "dummytext.parquet"))
> #> user system elapsed
> #> 0.24 0.02 0.25
> ```
> <sup>Created on 2021-09-17 by the [reprex
> package]([https://reprex.tidyverse.org|https://reprex.tidyverse.org/])
> (v2.0.0)</sup>
> <details style="margin-bottom:10px;">
> <summary>Session info</summary>
> ``` r
> sessioninfo::session_info()
> #> - Session info
> ---------------------------------------------------------------
> #> setting value
> #> version R version 4.1.0 (2021-05-18)
> #> os Windows 10 x64
> #> system x86_64, mingw32
> #> ui RTerm
> #> language (EN)
> #> collate English_Australia.1252
> #> ctype English_Australia.1252
> #> tz Australia/Brisbane
> #> date 2021-09-17
> #>
> #> - Packages
> -------------------------------------------------------------------
> #> package * version date lib source
> #> arrow * 5.0.0.2 2021-09-05 [1] CRAN (R 4.1.1)
> #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
> #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.1.0)
> #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0)
> #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0)
> #> broom 0.7.7 2021-06-13 [1] CRAN (R 4.1.0)
> #> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.1.0)
> #> cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0)
> #> colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.0)
> #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0)
> #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0)
> #> dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.1.0)
> #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0)
> #> dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.1.0)
> #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
> #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0)
> #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0)
> #> forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.1.0)
> #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
> #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0)
> #> ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0)
> #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0)
> #> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0)
> #> haven 2.4.1 2021-04-23 [1] CRAN (R 4.1.0)
> #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
> #> hms 1.1.0 2021-05-17 [1] CRAN (R 4.1.0)
> #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0)
> #> httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0)
> #> jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.1.0)
> #> knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0)
> #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0)
> #> lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.1.0)
> #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
> #> modelr 0.1.8 2020-05-19 [1] CRAN (R 4.1.0)
> #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
> #> pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0)
> #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
> #> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
> #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1)
> #> Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0)
> #> readr * 2.0.1 2021-08-10 [1] CRAN (R 4.1.1)
> #> readxl 1.3.1 2019-03-13 [1] CRAN (R 4.1.0)
> #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.1.0)
> #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0)
> #> rmarkdown 2.9 2021-06-15 [1] CRAN (R 4.1.0)
> #> rvest 1.0.1 2021-07-26 [1] CRAN (R 4.1.0)
> #> scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0)
> #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
> #> stringi 1.7.4 2021-08-25 [1] CRAN (R 4.1.1)
> #> stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
> #> styler 1.4.1 2021-03-30 [1] CRAN (R 4.1.0)
> #> tibble * 3.1.4 2021-08-25 [1] CRAN (R 4.1.1)
> #> tidyr * 1.1.3 2021-03-03 [1] CRAN (R 4.1.0)
> #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
> #> tidyverse * 1.3.1 2021-04-15 [1] CRAN (R 4.1.1)
> #> tzdb 0.1.2 2021-07-20 [1] CRAN (R 4.1.0)
> #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
> #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
> #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
> #> xfun 0.24 2021-06-15 [1] CRAN (R 4.1.0)
> #> xml2 1.3.2 2020-04-23 [1] CRAN (R 4.1.0)
> #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
> #>
> #> [1] C:/Users/msmcbain/libs/R
> #> [2] C:/R/R-4.1.0/library
> ```
> </details>
> In this case it's actually faster to convert the list columns to text and do
> the write, than to write with the list columns.
> This issue also affects write_arrow:
> ``` r
> library(tidyverse)
> #> Warning: package 'tidyverse' was built under R version 4.1.1
> #> Warning: package 'tibble' was built under R version 4.1.1
> #> Warning: package 'readr' was built under R version 4.1.1
> library(arrow)
> #> Warning: package 'arrow' was built under R version 4.1.1
> #>
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #>
> #> timestamp
> dummy <- tibble(
> points = rep(list(seq(6)), 2e6),
> index = seq(2e6)
> )
> # very slooooooow
> system.time(write_arrow(dummy, "dummy.parquet"))
> #> Warning: Use 'write_ipc_stream' or 'write_feather' instead.
> #> user system elapsed
> #> 56.95 0.08 57.13
> dummy_txt <- mutate(dummy, points = map_chr(points, deparse))
> # orders of magnitude faster
> system.time(write_arrow(dummy_txt, "dummytext.parquet"))
> #> Warning: Use 'write_ipc_stream' or 'write_feather' instead.
> #> user system elapsed
> #> 0.06 0.01 0.10
> ```
> <sup>Created on 2021-09-17 by the [reprex
> package]([https://reprex.tidyverse.org|https://reprex.tidyverse.org/])
> (v2.0.0)</sup>
> Interestingly the performance seems to degrade exponentially with the nesting
> level of the lists:
> ```r
> # add a level of nesting
> dummy2 <- tibble(
> points = rep(list(list(seq(6))), 2e6),
> index = seq(2e6)
> )
> # order of magnitude slower again, lost patience wating for it to return
> system.time(write_parquet(dummy2, "dummy2.parquet")
> ```
> This has implications for \{sf} dataframes which use list columns to
> represent spatial data structures. Arrow/parquet are pretty much not viable
> for moderate to large spatial data in R:
> ```r
> # options(timeout = 1000)
> remotes::install_github("wfmackey/absmapsdata")
> library(absmapsdata)
> # doesn't return in a resonable amount of time
> write_arrow(absmapsdata::sa12016, "sa1.parquet")
> # can use the same work around as above by converting geomtry to vector of
> well knowntext, but it takes time and bloats the files
> ```
> Possibly related to https://issues.apache.org/jira/browse/ARROW-12529 ?
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)