[
https://issues.apache.org/jira/browse/DRILL-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16517418#comment-16517418
]
Boaz Ben-Zvi commented on DRILL-6453:
-------------------------------------
Thanks [~khfaraaz] for getting all the information. From the log it clearly
shows that several Hash-Join operators in the query did not have enough memory.
The plan must be highly parallel, and with multiple "buffered" operators; hence
when the 10GB memory (per query per node) was divided among all of them, the
result (memory for this instance of Hash Join) was +*very small*+ ! The
number 40MB shown above is even bigger than that result, as we "artificially
bump up" the result to 40MB (see the internal option
"planner.memory.min_memory_per_buffered_op" ).
Even with this bump up, the memory was not sufficient (needed a minimum of
76MB). So the Hash-Join operator went into "fallback mode", that is - ignoring
any memory constraints (just like in 1.13 and before). This fallback is also
controlled by an option ("drill.exec.hashjoin.fallback.enabled"), which
currently defaults to true, but we want to change the default soon to false.
This change would cause the above query to fail with a detailed message
(suggesting more memory is needed).
Also the Hash-Aggregate has a similar option (default - false), which the above
shows was set to be true. We suggest customers not go this way, but rather
allocate more memory.
So what went wrong – so many operators used "fallback" and allocated more
memory than planned, possibly beyond 10GB in total. In case the total reached
12GB, the JVM would have caused an OOM. This looks very much like
h1. DRILL-6468: OOMs trigger graceful shutdown when terminating Drill. This can
cause a hang.
whose PR ( #1306 ) was just committed into the master branch. [~khfaraaz] –
can you test again with this new PR included ?
> TPC-DS query 72 has regressed
> -----------------------------
>
> Key: DRILL-6453
> URL: https://issues.apache.org/jira/browse/DRILL-6453
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Flow
> Affects Versions: 1.14.0
> Reporter: Khurram Faraaz
> Assignee: Boaz Ben-Zvi
> Priority: Blocker
> Fix For: 1.14.0
>
> Attachments: 24f75b18-014a-fb58-21d2-baeab5c3352c.sys.drill
>
>
> TPC-DS query 72 seems to have regressed, query profile for the case where it
> Canceled after 2 hours on Drill 1.14.0 is attached here.
> {noformat}
> On, Drill 1.14.0-SNAPSHOT
> commit : 931b43e (TPC-DS query 72 executed successfully on this commit, took
> around 55 seconds to execute)
> SF1 parquet data on 4 nodes;
> planner.memory.max_query_memory_per_node = 10737418240.
> drill.exec.hashagg.fallback.enabled = true
> TPC-DS query 72 executed successfully & took 47 seconds to complete execution.
> {noformat}
> {noformat}
> TPC-DS data in the below run has date values stored as DATE datatype and not
> VARCHAR type
> On, Drill 1.14.0-SNAPSHOT
> commit : 82e1a12
> SF1 parquet data on 4 nodes;
> planner.memory.max_query_memory_per_node = 10737418240.
> drill.exec.hashagg.fallback.enabled = true
> and
> alter system set `exec.hashjoin.num_partitions` = 1;
> TPC-DS query 72 executed for 2 hrs and 11 mins and did not complete, I had to
> Cancel it by stopping the Foreman drillbit.
> As a result several minor fragments are reported to be in
> CANCELLATION_REQUESTED state on UI.
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)