[
https://issues.apache.org/jira/browse/DRILL-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14529839#comment-14529839
]
Steven Phillips commented on DRILL-2936:
----------------------------------------
It turns out this is caused by a sort-of deadlock situation condition that can
arise with hash-to-merge exchange. The hash-to-merge exchange consists of a
partition sender and a merging receiver. The partition sender has outgoing
buckets it sends to the different downstream minor fragments. And each merging
receiver has an incoming buffer for each of the sending minor fragments.
The merging receiver cannot proceed without data from each of the sending
fragments. If data from any one of the sending fragments is unavailable, it
will block until it receives some data from that fragment, or a message
indicating there is no more data from that fragment.
If there is some skew in the data, it's possible that a partition sender may
not send any data to a particular receiver. That receiver will end up blocking
because it is waiting to receive that data. Since it is blocked, it is unable
to consume the data that it does receive from other senders. After a few
batches, the sender also blocks due to backpressure, because the receiver is
unable to consume.
Once we reach this state, the query hangs indefinitely.
> TPCH 4 and 18 SF100 hangs when hash agg is turned off
> -----------------------------------------------------
>
> Key: DRILL-2936
> URL: https://issues.apache.org/jira/browse/DRILL-2936
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Flow
> Reporter: Ramana Inukonda Nagaraj
> Assignee: Steven Phillips
> Priority: Critical
> Fix For: 1.0.0
>
> Attachments: Screen Shot 2015-05-01 at 2.40.36 PM.png
>
>
> sys options:
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirTpch100P> alter system set
> `planner.memory.max_query_memory_per_node` = 29205777612;
> 0: jdbc:drill:schema=dfs.drillTestDirTpch100P> alter system set
> `planner.enable_hashjoin`=false;
> 0: jdbc:drill:schema=dfs.drillTestDirTpch100P> alter system set
> `planner.enable_hashagg`=false;
> {code}
> On executing TPCH 04 query hangs. From the profiles page does not look like
> any fragments are making progress, the last progress time stamps were
> sometime back.
> Attached is the logical plan.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)