paleolimbot commented on pull request #11730: URL: https://github.com/apache/arrow/pull/11730#issuecomment-984921276
Putting this down for today, but narrowed down the segfault to computing the exec plan. Specifically, this line in the R package: https://github.com/apache/arrow/blob/a8ed77ef1d517e29675465e4c623085d3eb29e7d/r/src/compute-exec.cpp#L92 I haven't been able to get lldb working with R, so I don't know where this occurs in the Arrow library. I can force a segfault by creating a record batch reader from an R function (without DuckDB) as well, and I think the two are linked (because `read_table()` seems to work for both). I wonder if the exec plan is calling the `array_stream->get_next()` method from multiple threads? <details> ``` r library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) example_data <- tibble::tibble( int = c(1:3, NA_integer_, 5:10), dbl = c(1:8, NA, 10) + .1, dbl2 = rep(5, 10), lgl = sample(c(TRUE, FALSE, NA), 10, replace = TRUE), false = logical(10), chr = letters[c(1:5, NA, 7:10)], fct = factor(letters[c(1:4, NA, NA, 7:10)]) ) tf <- tempfile() new_ds <- rbind( cbind(example_data, part = 1), cbind(example_data, part = 2), cbind(example_data, part = 3), cbind(example_data, part = 4) ) %>% mutate(row_order = 1:n()) %>% select(-false, -lgl, -fct) write_dataset(new_ds, tf, partitioning = "part") ds <- open_dataset(tf) stream <- carrow:::blank_invalid_array_stream() stream_ptr <- carrow:::xptr_addr_double(stream) s <- Scanner$create( ds, NULL, filter = TRUE, use_async = FALSE, use_threads = TRUE )$ ToRecordBatchReader()$ export_to_c(stream_ptr) # now, create an R stream based on a function that wrap the input stream # basically, see if we can roundtrip through R stream2 <- carrow::carrow_array_stream_function(ds$schema, function() { message("streeeeaming!") carrow::carrow_array_stream_get_next(stream) }) rbr <- carrow::carrow_array_stream_to_arrow(stream2) # schema OK rbr$schema #> Schema #> int: int32 #> dbl: double #> dbl2: double #> chr: string #> row_order: int32 #> part: int32 #> #> See $metadata for additional Schema metadata # query create OK query <- arrow:::as_adq(rbr) # collect() is the only thing that segfaults # segfault is here: # https://github.com/apache/arrow/blob/master/r/src/compute-exec.cpp#L92 # result <- dplyr::collect(query) # ...but a manual scan is OK, as well as read_table() # (which may explain why the streaming worked before) # rbr$read_next_batch() # rbr$read_next_batch() # rbr$read_next_batch() # rbr$read_next_batch() # rbr$read_next_batch() rbr$read_table() #> streeeeaming! #> streeeeaming! #> streeeeaming! #> streeeeaming! #> streeeeaming! #> Table #> 40 rows x 6 columns #> $int <int32> #> $dbl <double> #> $dbl2 <double> #> $chr <string> #> $row_order <int32> #> $part <int32> #> #> See $metadata for additional Schema metadata ``` <sup>Created on 2021-12-02 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup> </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org