[
https://issues.apache.org/jira/browse/IMPALA-12233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Smith resolved IMPALA-12233.
------------------------------------
Fix Version/s: Impala 4.3.0
Resolution: Fixed
> Partitioned hash join with a limit can hang when using mt_dop>0
> ---------------------------------------------------------------
>
> Key: IMPALA-12233
> URL: https://issues.apache.org/jira/browse/IMPALA-12233
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.3.0
> Reporter: Joe McDonnell
> Assignee: Gergely Fürnstáhl
> Priority: Blocker
> Fix For: Impala 4.3.0
>
>
> After encountering a hung query on an Impala cluster, we were able to
> reproduce it in the Impala developer environment with these steps:
> {noformat}
> use tpcds;
> set mt_dop=2;
> select ss_cdemo_sk from store_sales where ss_sold_date_sk = (select
> max(ss_sold_date_sk) from store_sales) group by ss_cdemo_sk limit 1;{noformat}
> The problem reproduces with limit values up to 183, then at limit 184 and
> higher it doesn't reproduce.
> Taking stack traces show a thread waiting for a cyclic barrier:
> {noformat}
> 0 libpthread.so.0!__pthread_cond_wait + 0x216
> 1
> impalad!impala::CyclicBarrier::Wait<impala::PhjBuilder::DoneProbingHashPartitions(const
> int64_t*, impala::BufferPool::ClientHandle*, impala::RuntimeProfile*,
> std::deque<std::unique_ptr<impala::PhjBuilderPartition> >*,
> impala::RowBatch*)::<lambda()> > [condition-variable.h : 49 + 0xc]
> 2 impalad!impala::PhjBuilder::DoneProbingHashPartitions(long const*,
> impala::BufferPool::ClientHandle*, impala::RuntimeProfile*,
> std::deque<std::unique_ptr<impala::PhjBuilderPartition,
> std::default_delete<impala::PhjBuilderPartition> >,
> std::allocator<std::unique_ptr<impala::PhjBuilderPartition,
> std::default_delete<impala::PhjBuilderPartition> > > >*, impala::RowBatch*)
> [partitioned-hash-join-builder.cc : 766 + 0x25]
> 3
> impalad!impala::PartitionedHashJoinNode::DoneProbing(impala::RuntimeState*,
> impala::RowBatch*) [partitioned-hash-join-node.cc : 1189 + 0x28]
> 4 impalad!impala::PartitionedHashJoinNode::GetNext(impala::RuntimeState*,
> impala::RowBatch*, bool*) [partitioned-hash-join-node.cc : 599 + 0x15]
> 5
> impalad!impala::StreamingAggregationNode::GetRowsStreaming(impala::RuntimeState*,
> impala::RowBatch*) [streaming-aggregation-node.cc : 115 + 0x14]
> 6 impalad!impala::StreamingAggregationNode::GetNext(impala::RuntimeState*,
> impala::RowBatch*, bool*) [streaming-aggregation-node.cc : 77 + 0x15]
> 7 impalad!impala::FragmentInstanceState::ExecInternal()
> [fragment-instance-state.cc : 446 + 0x15]
> 8 impalad!impala::FragmentInstanceState::Exec() [fragment-instance-state.cc
> : 104 + 0xf]
> 9 impalad!impala::QueryState::ExecFInstance(impala::FragmentInstanceState*)
> [query-state.cc : 956 + 0xf]{noformat}
> Adding some debug logging around locations that go through that cyclic
> barrier, we see one Impalad where it is expecting two threads and only one
> arrives:
> {noformat}
> I0621 18:28:19.926551 210363 partitioned-hash-join-builder.cc:766]
> 2a4787b28425372d:ac6bd96200000004] DoneProbingHashPartitions:
> num_probe_threads_=2
> I0621 18:28:19.927855 210362 streaming-aggregation-node.cc:136]
> 2a4787b28425372d:ac6bd96200000003] the number of rows (93) returned from the
> streaming aggregation node has exceeded the limit of 1
> I0621 18:28:19.928887 210362 query-state.cc:958]
> 2a4787b28425372d:ac6bd96200000003] Instance completed.
> instance_id=2a4787b28425372d:ac6bd96200000003 #in-flight=4 status=OK{noformat}
> Other instances that don't have a stuck thread see both threads arrive:
> {noformat}
> I0621 18:28:19.926223 210358 partitioned-hash-join-builder.cc:766]
> 2a4787b28425372d:ac6bd96200000005] DoneProbingHashPartitions:
> num_probe_threads_=2
> I0621 18:28:19.926326 210359 partitioned-hash-join-builder.cc:766]
> 2a4787b28425372d:ac6bd96200000006] DoneProbingHashPartitions:
> num_probe_threads_=2{noformat}
> So, there must be a codepath that skips going through the cyclic barrier.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)