[GitHub] [arrow] paleolimbot commented on pull request #11730: ARROW-14745: [R] Enable true duckdb streaming

GitBox Thu, 02 Dec 2021 11:09:52 -0800


paleolimbot commented on pull request #11730:
URL: https://github.com/apache/arrow/pull/11730#issuecomment-984921276



   Putting this down for today, but narrowed down the segfault to computing the 
exec plan. Specifically, this line in the R package:
   
   
https://github.com/apache/arrow/blob/a8ed77ef1d517e29675465e4c623085d3eb29e7d/r/src/compute-exec.cpp#L92
   
   I haven't been able to get lldb working with R, so I don't know where this 
occurs in the Arrow library.
   
   I can force a segfault by creating a record batch reader from an R function 
(without DuckDB) as well, and I think the two are linked (because 
`read_table()` seems to work for both). I wonder if the exec plan is calling 
the `array_stream->get_next()` method from multiple threads?
   
   <details>
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   library(dplyr, warn.conflicts = FALSE)
   
   example_data <- tibble::tibble(
     int = c(1:3, NA_integer_, 5:10),
     dbl = c(1:8, NA, 10) + .1,
     dbl2 = rep(5, 10),
     lgl = sample(c(TRUE, FALSE, NA), 10, replace = TRUE),
     false = logical(10),
     chr = letters[c(1:5, NA, 7:10)],
     fct = factor(letters[c(1:4, NA, NA, 7:10)])
   )
   
   tf <- tempfile()
   new_ds <- rbind(
     cbind(example_data, part = 1),
     cbind(example_data, part = 2),
     cbind(example_data, part = 3),
     cbind(example_data, part = 4)
   ) %>%
     mutate(row_order = 1:n()) %>% 
     select(-false, -lgl, -fct)
   
   write_dataset(new_ds, tf, partitioning = "part")
   
   ds <- open_dataset(tf)
   
   stream <- carrow:::blank_invalid_array_stream()
   stream_ptr <- carrow:::xptr_addr_double(stream)
   s <- Scanner$create(
     ds, 
     NULL,
     filter = TRUE,
     use_async = FALSE,
     use_threads = TRUE
   )$
     ToRecordBatchReader()$
     export_to_c(stream_ptr)
   
   
   # now, create an R stream based on a function that wrap the input stream
   # basically, see if we can roundtrip through R
   stream2 <- carrow::carrow_array_stream_function(ds$schema, function() {
     message("streeeeaming!")
     carrow::carrow_array_stream_get_next(stream)
   })
   
   rbr <- carrow::carrow_array_stream_to_arrow(stream2)
   
   # schema OK
   rbr$schema
   #> Schema
   #> int: int32
   #> dbl: double
   #> dbl2: double
   #> chr: string
   #> row_order: int32
   #> part: int32
   #> 
   #> See $metadata for additional Schema metadata
   
   # query create OK
   query <- arrow:::as_adq(rbr) 
   
   # collect() is the only thing that segfaults
   # segfault is here:
   # https://github.com/apache/arrow/blob/master/r/src/compute-exec.cpp#L92
   # result <- dplyr::collect(query)
   
   # ...but a manual scan is OK, as well as read_table()
   # (which may explain why the streaming worked before)
   # rbr$read_next_batch()
   # rbr$read_next_batch()
   # rbr$read_next_batch()
   # rbr$read_next_batch()
   # rbr$read_next_batch()
   rbr$read_table()
   #> streeeeaming!
   #> streeeeaming!
   #> streeeeaming!
   #> streeeeaming!
   #> streeeeaming!
   #> Table
   #> 40 rows x 6 columns
   #> $int <int32>
   #> $dbl <double>
   #> $dbl2 <double>
   #> $chr <string>
   #> $row_order <int32>
   #> $part <int32>
   #> 
   #> See $metadata for additional Schema metadata
   ```
   
   <sup>Created on 2021-12-02 by the [reprex 
package](https://reprex.tidyverse.org) (v2.0.1)</sup>
   
   </details>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] paleolimbot commented on pull request #11730: ARROW-14745: [R] Enable true duckdb streaming

Reply via email to