[GitHub] [arrow] paleolimbot commented on issue #11665: read_feather error with 32 GB file

GitBox Thu, 04 Aug 2022 05:41:35 -0700


paleolimbot commented on issue #11665:
URL: https://github.com/apache/arrow/issues/11665#issuecomment-1205203653


   I can confirm that there's no batching happening (1) when converting the 
data.frame to a Table or (2) when converting an R vector to a ChunkedArray: no 
matter how big the data frame, it will always (to my reading of the code) be 
completely converted to a Table whose member chunked arrays consist of a single 
chunk prior to getting written as Feather.
   
   There is an open Jira (ARROW-15405) to allow `write_ipc_stream()`, 
`write_feather()`, and `write_parquet() to accept `RecordBatchReader` 
(`write_csv_arrow()` already does, thanks to Nic!). In combination with a 
chunking `as_record_batch_reader()` method for `data.frame`, that would almost 
certainly solve the hang-on-write issue.
   
   It sounds like the root cause of the hang, though, is that something about 
the data frame to feather write operation is using a lot more memory in some 
cases than anybody can explain. It would be helpful to have a minimal 
reproducer for that that runs in a reasonable amount of time...I know how to 
profile R memory usage (`bench::mark()` will do it using the `profmem` 
package), but I don't have a strategy to profile other allocations other than 
inspecting the default memory pool.
   
   Something that crossed my mind as I was writing the reprex below is that in 
R we have a global string pool, which means that `c("string one", "string one", 
"string one")` only stores `"string one"` once. If we expand that to an Arrow 
`string()` all at once, we copy `"string one"` a lot of times and potentially 
use more memory than `lobstr::obj_size()` might suggest.
   
   Perhaps a starting place:
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   #> Some features are not enabled in this build of Arrow. Run `arrow_info()` 
for more information.
   
   big_df <- vctrs::vec_rep(ggplot2::mpg, 1e4)
   lobstr::obj_size(big_df)
   #> 168.49 MB
   tf <- tempfile()
   
   bench::as_bench_bytes(default_memory_pool()$max_memory)
   #> [1] 0B
   
   bench::mark(write_feather(big_df, tf))
   #> # A tibble: 1 × 6
   #>   expression                     min   median `itr/sec` mem_alloc `gc/sec`
   #>   <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
   #> 1 write_feather(big_df, tf)    224ms    230ms      4.37    3.02MB        0
   bench::as_bench_bytes(default_memory_pool()$max_memory)
   #> [1] 392MB
   bench::as_bench_bytes(file.size(tf))
   #> [1] 56.5MB
   ```
   
   <sup>Created on 2022-08-04 by the [reprex 
package](https://reprex.tidyverse.org) (v2.0.1)</sup>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] paleolimbot commented on issue #11665: read_feather error with 32 GB file

Reply via email to