[
https://issues.apache.org/jira/browse/DRILL-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15878985#comment-15878985
]
Paul Rogers commented on DRILL-5289:
------------------------------------
In general, Java programs cannot gracefully handle heap exhaustion: any attempt
to do work requires creating objects which cannot be done because... well...
the heap is exhausted.
A better solution is to manage heap resource usage: understand our usage, plan
for it and inform the user of the heap needs. Since we don't understand our
heap usage, we may well be creating large objects on the heap unnecessarily, or
creating so many objects that the GC kicks in too frequently.
So, I'd reword this to not pre-suppose a solution. The solution is not to
exhaust memory, then deal with it. The solution is to mange memory so that we
don't exhaust heap in the first place.
> Drill should handle OOM due to insufficient heap type of errors more
> gracefully
> -------------------------------------------------------------------------------
>
> Key: DRILL-5289
> URL: https://issues.apache.org/jira/browse/DRILL-5289
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Flow, Execution - RPC
> Affects Versions: 1.10.0
> Reporter: Rahul Challapalli
> Attachments: jstack.txt, partial_log.txt, Screen Shot 2017-02-22 at
> 10.58.39 AM (2).png
>
>
> [Git Commit ID will be updated soon]
> The below query which uses the managed sort causes an OOM error due to
> insufficient heap, which is a bug in itself.
> {code}
> ALTER SESSION SET `exec.sort.disable_managed` = false;
> +-------+-------------------------------------+
> | ok | summary |
> +-------+-------------------------------------+
> | true | exec.sort.disable_managed updated. |
> +-------+-------------------------------------+
> 1 row selected (1.096 seconds)
> 0: jdbc:drill:zk=10.10.100.183:5181> alter session set
> `planner.memory.max_query_memory_per_node` = 14106127360;
> +-------+----------------------------------------------------+
> | ok | summary |
> +-------+----------------------------------------------------+
> | true | planner.memory.max_query_memory_per_node updated. |
> +-------+----------------------------------------------------+
> 1 row selected (0.253 seconds)
> 0: jdbc:drill:zk=10.10.100.183:5181> alter session set
> `planner.width.max_per_node` = 1;
> +-------+--------------------------------------+
> | ok | summary |
> +-------+--------------------------------------+
> | true | planner.width.max_per_node updated. |
> +-------+--------------------------------------+
> 1 row selected (0.184 seconds)
> 0: jdbc:drill:zk=10.10.100.183:5181> select * from (select * from
> dfs.`/drill/testdata/resource-manager/250wide.tbl` order by columns[0])d
> where d.columns[0] = 'ljdfhwuehnoiueyf';
> {code}
> Once the OOM happens chaos follows
> {code}
> 1. Dangling fragments are left behind
> 2. Query fails but zookeeper thinks its still running
> 3. Client connection timeouts
> 4. Profile page shows the same query as both running and failed.
> {code}
> We should be handling this situation more gracefully as this could be
> perceived as a drillbit stability issue. I attached the jstack. The logs and
> data set used are too big to upload here. Reach out to me if you need more
> information.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)