[ 
https://issues.apache.org/jira/browse/DRILL-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16543983#comment-16543983
 ] 

Vlad Rozov commented on DRILL-6453:
-----------------------------------

A deadlock when an operator such as hash join switches between reading from 
left and right sides is caused by:

- Drill senders can send only one batch at a time. For senders such as 
broadcast or hash partitioner, it means that if one of the receivers did not 
acknowledge 3 batches, the sender won't be able to send to any of it's 
receivers and would block until the receiver sends an acknowledgment (for 
previously sent batches).
- On the receiving side, if for example hash join flips between reading from 
left and right sides, it may lead to a condition where for one minor fragment, 
left side is empty while for another minor fragment the right side is empty.
- Drill does not allow to probe if receiver queue is empty, so the first 
fragment would block waiting for the left side to become not empty, while the 
second minor fragment would block on the same condition for the right side.
- As hash join reads from left and right sides on the same thread, when it 
blocks reading from left side, right side may become full and no more 
acknowledgments would be sent to the sender. The same for the second minor 
fragment with left and right flipped. And Drill is at the deadlock condition as 
neither receiver or sender can proceed.
 

> TPC-DS query 72 has regressed
> -----------------------------
>
>                 Key: DRILL-6453
>                 URL: https://issues.apache.org/jira/browse/DRILL-6453
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 1.14.0
>            Reporter: Khurram Faraaz
>            Assignee: Boaz Ben-Zvi
>            Priority: Blocker
>             Fix For: 1.14.0
>
>         Attachments: 24f75b18-014a-fb58-21d2-baeab5c3352c.sys.drill, 
> jstack_29173_June_10_2018.txt, jstack_29173_June_10_2018.txt, 
> jstack_29173_June_10_2018_b.txt, jstack_29173_June_10_2018_b.txt, 
> jstack_29173_June_10_2018_c.txt, jstack_29173_June_10_2018_c.txt, 
> jstack_29173_June_10_2018_d.txt, jstack_29173_June_10_2018_d.txt, 
> jstack_29173_June_10_2018_e.txt, jstack_29173_June_10_2018_e.txt
>
>
> TPC-DS query 72 seems to have regressed, query profile for the case where it 
> Canceled after 2 hours on Drill 1.14.0 is attached here.
> {noformat}
> On, Drill 1.14.0-SNAPSHOT 
> commit : 931b43e (TPC-DS query 72 executed successfully on this commit, took 
> around 55 seconds to execute)
> SF1 parquet data on 4 nodes; 
> planner.memory.max_query_memory_per_node = 10737418240. 
> drill.exec.hashagg.fallback.enabled = true
> TPC-DS query 72 executed successfully & took 47 seconds to complete execution.
> {noformat}
> {noformat}
> TPC-DS data in the below run has date values stored as DATE datatype and not 
> VARCHAR type
> On, Drill 1.14.0-SNAPSHOT
> commit : 82e1a12
> SF1 parquet data on 4 nodes; 
> planner.memory.max_query_memory_per_node = 10737418240. 
> drill.exec.hashagg.fallback.enabled = true
> and
> alter system set `exec.hashjoin.num_partitions` = 1;
> TPC-DS query 72 executed for 2 hrs and 11 mins and did not complete, I had to 
> Cancel it by stopping the Foreman drillbit.
> As a result several minor fragments are reported to be in 
> CANCELLATION_REQUESTED state on UI.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to