[ 
https://issues.apache.org/jira/browse/DRILL-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16517418#comment-16517418
 ] 

Boaz Ben-Zvi commented on DRILL-6453:
-------------------------------------

Thanks [~khfaraaz] for getting all the information. From the log it clearly 
shows that several Hash-Join operators in the query did not have enough memory. 
The plan must be highly parallel, and with multiple "buffered" operators; hence 
when the 10GB memory (per query per node) was divided among all of them, the 
result (memory for this instance of Hash Join) was +*very small*+ !   The 
number 40MB shown above is even bigger than that result, as we "artificially 
bump up" the result to 40MB (see the internal option 
"planner.memory.min_memory_per_buffered_op" ).

  Even with this bump up, the memory was not sufficient (needed a minimum of 
76MB). So the Hash-Join operator went into "fallback mode", that is - ignoring 
any memory constraints (just like in 1.13 and before). This fallback is also 
controlled by an option ("drill.exec.hashjoin.fallback.enabled"), which 
currently defaults to true, but we want to change the default soon to false. 
This change would cause the above query to fail with a detailed message 
(suggesting more memory is needed).

Also the Hash-Aggregate has a similar option (default - false), which the above 
shows was set to be true. We suggest customers not go this way, but rather 
allocate more memory.

So what went wrong – so many operators used "fallback" and allocated more 
memory than planned, possibly beyond 10GB in total. In case the total reached 
12GB, the JVM would have caused an OOM. This looks very much like 
h1. DRILL-6468: OOMs trigger graceful shutdown when terminating Drill. This can 
cause a hang.

whose PR ( #1306 ) was just committed into the master branch.  [~khfaraaz] – 
can you test again with this new PR included ?

 

> TPC-DS query 72 has regressed
> -----------------------------
>
>                 Key: DRILL-6453
>                 URL: https://issues.apache.org/jira/browse/DRILL-6453
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 1.14.0
>            Reporter: Khurram Faraaz
>            Assignee: Boaz Ben-Zvi
>            Priority: Blocker
>             Fix For: 1.14.0
>
>         Attachments: 24f75b18-014a-fb58-21d2-baeab5c3352c.sys.drill
>
>
> TPC-DS query 72 seems to have regressed, query profile for the case where it 
> Canceled after 2 hours on Drill 1.14.0 is attached here.
> {noformat}
> On, Drill 1.14.0-SNAPSHOT 
> commit : 931b43e (TPC-DS query 72 executed successfully on this commit, took 
> around 55 seconds to execute)
> SF1 parquet data on 4 nodes; 
> planner.memory.max_query_memory_per_node = 10737418240. 
> drill.exec.hashagg.fallback.enabled = true
> TPC-DS query 72 executed successfully & took 47 seconds to complete execution.
> {noformat}
> {noformat}
> TPC-DS data in the below run has date values stored as DATE datatype and not 
> VARCHAR type
> On, Drill 1.14.0-SNAPSHOT
> commit : 82e1a12
> SF1 parquet data on 4 nodes; 
> planner.memory.max_query_memory_per_node = 10737418240. 
> drill.exec.hashagg.fallback.enabled = true
> and
> alter system set `exec.hashjoin.num_partitions` = 1;
> TPC-DS query 72 executed for 2 hrs and 11 mins and did not complete, I had to 
> Cancel it by stopping the Foreman drillbit.
> As a result several minor fragments are reported to be in 
> CANCELLATION_REQUESTED state on UI.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to