Vitalie Spinu created ARROW-15938:
-------------------------------------
Summary: [R][C++] Segfault in left join with empty right table
when filtered on partition
Key: ARROW-15938
URL: https://issues.apache.org/jira/browse/ARROW-15938
Project: Apache Arrow
Issue Type: Bug
Components: C++, Compute IR
Affects Versions: 7.0.1
Environment: ubuntu linux, R4.1.2
Reporter: Vitalie Spinu
When the right table in a join is empty as a result of a filtering on a
partition group the join segfaults:
{code:java}
library(arrow)
library(glue)
df <- mutate(iris, id = runif(n()))
dir <- "./tmp/iris"
dir.create(glue("{dir}/group=a/"), recursive = T, showWarnings = F)
dir.create(glue("{dir}/group=b/"), recursive = T, showWarnings = F)
write_parquet(df, glue("{dir}/group=a/part1.parquet"))
write_parquet(df, glue("{dir}/group=b/part2.parquet")) db1 <-
open_dataset(dir) %>%
filter(group == "blabla") open_dataset(dir) %>%
filter(group == "b") %>%
select(id) %>%
left_join(db1, by = "id") %>%
collect()
{code}
{code:java}
==24063== Thread 7:
==24063== Invalid read of size 1
==24063== at 0x1FFE606D:
arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(long,
arrow::compute::ExecBatch*, arrow::compute::ExecBatch*,
arrow::compute::ExecBatch*, arrow::compute::ExecBatch*) (in
/home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
==24063== by 0x1FFE68CC:
arrow::compute::HashJoinBasicImpl::ProbeBatch_OutputOne(unsigned long, long,
int const*, int const*) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
==24063== by 0x1FFE84D5:
arrow::compute::HashJoinBasicImpl::ProbeBatch(unsigned long,
arrow::compute::ExecBatch const&) (in
/home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
==24063== by 0x1FFE8CB4:
arrow::compute::HashJoinBasicImpl::InputReceived(unsigned long, int,
arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
==24063== by 0x200011CF:
arrow::compute::HashJoinNode::InputReceived(arrow::compute::ExecNode*,
arrow::compute::ExecBatch) (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
==24063== by 0x1FFB580E:
arrow::compute::MapNode::SubmitTask(std::function<arrow::Result<arrow::compute::ExecBatch>
(arrow::compute::ExecBatch)>,
arrow::compute::ExecBatch)::{lambda()#1}::operator()() const (in
/home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
==24063== by 0x1FFB6444: arrow::internal::FnOnce<void
()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture
(arrow::Future<arrow::internal::Empty>,
arrow::compute::MapNode::SubmitTask(std::function<arrow::Result<arrow::compute::ExecBatch>
(arrow::compute::ExecBatch)>,
arrow::compute::ExecBatch)::{lambda()#2}::operator()() const::{lambda()#1})>
>::invoke() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
==24063== by 0x1FE2B2A0:
std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}>
> >::_M_run() (in /home/vspinu/bin/arrow/lib/libarrow.so.800.0.0)
==24063== by 0x92844BF: ??? (in
/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29)
==24063== by 0x6DD46DA: start_thread (pthread_create.c:463)
==24063== by 0x710D71E: clone (clone.S:95)
==24063== Address 0x10 is not stack'd, malloc'd or (recently) free'd
==24063== *** caught segfault ***
address 0x10, cause 'memory not mapped'Traceback:
1: Table__from_RecordBatchReader(self)
2: tab$read_table()
3: do_exec_plan(x)
4: doTryCatch(return(expr), name, parentenv, handler)
5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
6: tryCatchList(expr, classes, parentenv, handlers)
7: tryCatch(tab <- do_exec_plan(x), error = function(e) {
handle_csv_read_error(e, x$.data$schema)})
8: collect.arrow_dplyr_query(.)
9: collect(.)
10: open_dataset(dir) %>% filter(group == "b") %>% select(id) %>%
left_join(db1, by = "id") %>% collect()Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace {code}
This is arrow from current master ece0e23f1.
It's worth noting that if the right table is filtered on a non-partitioned
variable the problem does not occur.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)