[
https://issues.apache.org/jira/browse/ARROW-14197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weston Pace updated ARROW-14197:
--------------------------------
Attachment: gdb.log
> [C++] Hashjoin + datasets hanging
> ---------------------------------
>
> Key: ARROW-14197
> URL: https://issues.apache.org/jira/browse/ARROW-14197
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Jonathan Keane
> Priority: Critical
> Labels: query-engine
> Fix For: 6.0.0
>
> Attachments: gdb.log, sample-while-hung.out.txt
>
>
> I’m getting a hang on the TPC-H query 4 pretty reliably (though it’s not
> _every_ time). The query is:
> {code}
> l <- input_table("lineitem") %>%
> select(l_orderkey, l_commitdate, l_receiptdate) %>%
> filter(l_commitdate < l_receiptdate) %>%
> select(l_orderkey)
> o <- input_table("orders") %>%
> select(o_orderkey, o_orderdate, o_orderpriority) %>%
> # kludge: filter(o_orderdate >= "1993-07-01", o_orderdate < "1993-07-01"
> + interval '3' month) %>%
> filter(o_orderdate >= as.Date("1993-07-01"), o_orderdate <
> as.Date("1993-10-01")) %>%
> select(o_orderkey, o_orderpriority)
> # distinct after join, tested and indeed faster
> lo <- inner_join(l, o, by = c("l_orderkey" = "o_orderkey")) %>%
> distinct() %>%
> select(o_orderpriority)
> aggr <- lo %>%
> group_by(o_orderpriority) %>%
> summarise(order_count = n()) %>%
> arrange(o_orderpriority) %>%
> collect()
> {code}
> Basically, filtered lineitems, filtered orders, join those together,
> group_by, summarise, arrange.
> This happens pretty reliably when the {{input_table}} is a dataset backed by
> parquet or feather fiels (e.g. {{input_table}} returns something like
> {{arrow::open_dataset("path/to/{filename}.feather", format = "feather")}}
> One can replicate this by installing an arrowbench branch
> (https://github.com/ursacomputing/arrowbench/pull/37) with, in R:
> {{remotes::install_github("ursacomputing/arrowbench@moar-tpch"}} and then
> running the following:
> {code}
> library(arrowbench)
> results <- run_benchmark(
> tpc_h,
> scale_factor = 1,
> cpu_count = 8,
> query_id = 4,
> lib_path = "remote-apache/arrow@HEAD", # remove this line if you have a
> recent install of the arrow r package that supports hash joins and want to
> avoid building a separate copy.
> format = "feather",
> n_iter = 20
> )
> {code}
> Note this _sometimes_ will finish, but frequently it will not and be stuck.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)