[jira] [Commented] (ARROW-14020) [R] Writing datafames with list columns is slow and scales poorly with nesting level

Jonathan Keane (Jira) Fri, 17 Sep 2021 07:19:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-14020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416718#comment-17416718
 ]


Jonathan Keane commented on ARROW-14020:
----------------------------------------

Thanks for the report! 

I did a bit of profiling, and I think I see where this is slowing down: when 
presented with a list column, {arrow} inspects the attributes of the list 
column (and specifically of each element in the column) in order to save that 
as metadata (even though these list columns don't even have any additional 
attributes!). [1] We've actually already disabled this metadata when 
interacting with datasets (ARROW-13189), and it's possible we should take out 
this saving entirely (though we probably need to still provide an option for 
doing it since people might depend on being able to save that or read it in; or 
we could streamline that process to improve it.

On the downstream impacts on SF: We've done a little bit of exploration on 
making the experience saving SF columns better (eg ARROW-12542), though we 
haven't gotten something perfect just yet. For now, at least, we would 
recommend if you're saving large amounts of SF data and running into issues 
like this, checkout the [{sfarrow}|https://github.com/wcjochem/sfarrow] package 
which is similar to the workaround you proposed, using well-known binary for 
encoding the SF column(s), along with some utilities/helper functions. 

[1] - Profiling showed that 
https://github.com/apache/arrow/blob/b599a0539fffd1bb226ebce83e2f035d3080ac41/r/R/metadata.R#L160
 takes a large amount of the time converting to a table, and inside of that 
https://github.com/apache/arrow/blob/b599a0539fffd1bb226ebce83e2f035d3080ac41/r/R/metadata.R#L129
 takes about between a quarter and half of that time.


> [R] Writing datafames with list columns is slow and scales poorly with 
> nesting level
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-14020
>                 URL: https://issues.apache.org/jira/browse/ARROW-14020
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 5.0.0
>         Environment: Windows 10 x64
>            Reporter: Miles McBain
>            Priority: Major
>
> Writing data frames that contain list columns seems much slower than expected:
> ``` r
>  library(tidyverse)
>  #> Warning: package 'tidyverse' was built under R version 4.1.1
>  #> Warning: package 'tibble' was built under R version 4.1.1
>  #> Warning: package 'readr' was built under R version 4.1.1
>  library(arrow)
>  #> Warning: package 'arrow' was built under R version 4.1.1
>  #>
>  #> Attaching package: 'arrow'
>  #> The following object is masked from 'package:utils':
>  #>
>  #> timestamp
>  dummy <- tibble(
>  points = rep(list(seq(6)), 2e6),
>  index = seq(2e6)
>  )
>  # very slooooooow
>  system.time(write_parquet(dummy, "dummy.parquet"))
>  #> user system elapsed
>  #> 55.64 0.11 55.98
> dummy_txt <- mutate(dummy, points = map_chr(points, deparse))
>  # orders of magnitude faster
>  system.time(write_parquet(dummy_txt, "dummytext.parquet"))
>  #> user system elapsed
>  #> 0.24 0.02 0.25
>  ```
> <sup>Created on 2021-09-17 by the [reprex 
> package]([https://reprex.tidyverse.org|https://reprex.tidyverse.org/]) 
> (v2.0.0)</sup>
> <details style="margin-bottom:10px;">
> <summary>Session info</summary>
> ``` r
>  sessioninfo::session_info()
>  #> - Session info 
> ---------------------------------------------------------------
>  #> setting value
>  #> version R version 4.1.0 (2021-05-18)
>  #> os Windows 10 x64
>  #> system x86_64, mingw32
>  #> ui RTerm
>  #> language (EN)
>  #> collate English_Australia.1252
>  #> ctype English_Australia.1252
>  #> tz Australia/Brisbane
>  #> date 2021-09-17
>  #>
>  #> - Packages 
> -------------------------------------------------------------------
>  #> package * version date lib source
>  #> arrow * 5.0.0.2 2021-09-05 [1] CRAN (R 4.1.1)
>  #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
>  #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.1.0)
>  #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0)
>  #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0)
>  #> broom 0.7.7 2021-06-13 [1] CRAN (R 4.1.0)
>  #> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.1.0)
>  #> cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0)
>  #> colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.0)
>  #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0)
>  #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0)
>  #> dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.1.0)
>  #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0)
>  #> dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.1.0)
>  #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
>  #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0)
>  #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0)
>  #> forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.1.0)
>  #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
>  #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0)
>  #> ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0)
>  #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0)
>  #> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0)
>  #> haven 2.4.1 2021-04-23 [1] CRAN (R 4.1.0)
>  #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
>  #> hms 1.1.0 2021-05-17 [1] CRAN (R 4.1.0)
>  #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0)
>  #> httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0)
>  #> jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.1.0)
>  #> knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0)
>  #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0)
>  #> lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.1.0)
>  #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
>  #> modelr 0.1.8 2020-05-19 [1] CRAN (R 4.1.0)
>  #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
>  #> pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0)
>  #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
>  #> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
>  #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1)
>  #> Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0)
>  #> readr * 2.0.1 2021-08-10 [1] CRAN (R 4.1.1)
>  #> readxl 1.3.1 2019-03-13 [1] CRAN (R 4.1.0)
>  #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.1.0)
>  #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0)
>  #> rmarkdown 2.9 2021-06-15 [1] CRAN (R 4.1.0)
>  #> rvest 1.0.1 2021-07-26 [1] CRAN (R 4.1.0)
>  #> scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0)
>  #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
>  #> stringi 1.7.4 2021-08-25 [1] CRAN (R 4.1.1)
>  #> stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
>  #> styler 1.4.1 2021-03-30 [1] CRAN (R 4.1.0)
>  #> tibble * 3.1.4 2021-08-25 [1] CRAN (R 4.1.1)
>  #> tidyr * 1.1.3 2021-03-03 [1] CRAN (R 4.1.0)
>  #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
>  #> tidyverse * 1.3.1 2021-04-15 [1] CRAN (R 4.1.1)
>  #> tzdb 0.1.2 2021-07-20 [1] CRAN (R 4.1.0)
>  #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
>  #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
>  #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
>  #> xfun 0.24 2021-06-15 [1] CRAN (R 4.1.0)
>  #> xml2 1.3.2 2020-04-23 [1] CRAN (R 4.1.0)
>  #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
>  #>
>  #> [1] C:/Users/msmcbain/libs/R
>  #> [2] C:/R/R-4.1.0/library
>  ```
> </details>
> In this case it's actually faster to convert the list columns to text and do 
> the write, than to write with the list columns. 
> This issue also affects write_arrow:
> ``` r
>  library(tidyverse)
>  #> Warning: package 'tidyverse' was built under R version 4.1.1
>  #> Warning: package 'tibble' was built under R version 4.1.1
>  #> Warning: package 'readr' was built under R version 4.1.1
>  library(arrow)
>  #> Warning: package 'arrow' was built under R version 4.1.1
>  #>
>  #> Attaching package: 'arrow'
>  #> The following object is masked from 'package:utils':
>  #>
>  #> timestamp
>  dummy <- tibble(
>  points = rep(list(seq(6)), 2e6),
>  index = seq(2e6)
>  )
>  # very slooooooow
>  system.time(write_arrow(dummy, "dummy.parquet"))
>  #> Warning: Use 'write_ipc_stream' or 'write_feather' instead.
>  #> user system elapsed
>  #> 56.95 0.08 57.13
> dummy_txt <- mutate(dummy, points = map_chr(points, deparse))
>  # orders of magnitude faster
>  system.time(write_arrow(dummy_txt, "dummytext.parquet"))
>  #> Warning: Use 'write_ipc_stream' or 'write_feather' instead.
>  #> user system elapsed
>  #> 0.06 0.01 0.10
>  ```
> <sup>Created on 2021-09-17 by the [reprex 
> package]([https://reprex.tidyverse.org|https://reprex.tidyverse.org/]) 
> (v2.0.0)</sup>
> Interestingly the performance seems to degrade exponentially with the nesting 
> level of the lists:
> ```r
> # add a level of nesting
>  dummy2 <- tibble(
>    points = rep(list(list(seq(6))), 2e6),
>    index = seq(2e6)
>  )
> # order of magnitude slower again, lost patience wating for it to return
>  system.time(write_parquet(dummy2, "dummy2.parquet")
>  ```
> This has implications for \{sf} dataframes which use list columns to 
> represent spatial data structures. Arrow/parquet are pretty much not viable 
> for moderate to large spatial data in R:
> ```r
>  # options(timeout = 1000)
> remotes::install_github("wfmackey/absmapsdata")
>  library(absmapsdata)
>  # doesn't return in a resonable amount of time
>  write_arrow(absmapsdata::sa12016, "sa1.parquet")
>  # can use the same work around as above by converting geomtry to vector of 
> well knowntext, but it takes time and bloats the files
>  ```
> Possibly related to https://issues.apache.org/jira/browse/ARROW-12529 ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-14020) [R] Writing datafames with list columns is slow and scales poorly with nesting level

Reply via email to