paleolimbot commented on issue #11665:
URL: https://github.com/apache/arrow/issues/11665#issuecomment-1205203653
I can confirm that there's no batching happening (1) when converting the
data.frame to a Table or (2) when converting an R vector to a ChunkedArray: no
matter how big the data frame, it will always (to my reading of the code) be
completely converted to a Table whose member chunked arrays consist of a single
chunk prior to getting written as Feather.
There is an open Jira (ARROW-15405) to allow `write_ipc_stream()`,
`write_feather()`, and `write_parquet() to accept `RecordBatchReader`
(`write_csv_arrow()` already does, thanks to Nic!). In combination with a
chunking `as_record_batch_reader()` method for `data.frame`, that would almost
certainly solve the hang-on-write issue.
It sounds like the root cause of the hang, though, is that something about
the data frame to feather write operation is using a lot more memory in some
cases than anybody can explain. It would be helpful to have a minimal
reproducer for that that runs in a reasonable amount of time...I know how to
profile R memory usage (`bench::mark()` will do it using the `profmem`
package), but I don't have a strategy to profile other allocations other than
inspecting the default memory pool.
Something that crossed my mind as I was writing the reprex below is that in
R we have a global string pool, which means that `c("string one", "string one",
"string one")` only stores `"string one"` once. If we expand that to an Arrow
`string()` all at once, we copy `"string one"` a lot of times and potentially
use more memory than `lobstr::obj_size()` might suggest.
Perhaps a starting place:
``` r
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()`
for more information.
big_df <- vctrs::vec_rep(ggplot2::mpg, 1e4)
lobstr::obj_size(big_df)
#> 168.49 MB
tf <- tempfile()
bench::as_bench_bytes(default_memory_pool()$max_memory)
#> [1] 0B
bench::mark(write_feather(big_df, tf))
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 write_feather(big_df, tf) 224ms 230ms 4.37 3.02MB 0
bench::as_bench_bytes(default_memory_pool()$max_memory)
#> [1] 392MB
bench::as_bench_bytes(file.size(tf))
#> [1] 56.5MB
```
<sup>Created on 2022-08-04 by the [reprex
package](https://reprex.tidyverse.org) (v2.0.1)</sup>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]