[ 
https://issues.apache.org/jira/browse/DRILL-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rahul Challapalli updated DRILL-5289:
-------------------------------------
    Summary: Drill should manage the heap memory so that we wouldn't hit an OOM 
due to insufficient heap  (was: Drill should handle OOM due to insufficient 
heap type of errors more gracefully)

> Drill should manage the heap memory so that we wouldn't hit an OOM due to 
> insufficient heap
> -------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5289
>                 URL: https://issues.apache.org/jira/browse/DRILL-5289
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow, Execution - RPC
>    Affects Versions: 1.10.0
>            Reporter: Rahul Challapalli
>         Attachments: jstack.txt, partial_log.txt, Screen Shot 2017-02-22 at 
> 10.58.39 AM (2).png
>
>
> [Git Commit ID will be updated soon]
> The below query which uses the managed sort causes an OOM error due to 
> insufficient heap, which is a bug in itself. 
> {code}
> ALTER SESSION SET `exec.sort.disable_managed` = false;
> +-------+-------------------------------------+
> |  ok   |               summary               |
> +-------+-------------------------------------+
> | true  | exec.sort.disable_managed updated.  |
> +-------+-------------------------------------+
> 1 row selected (1.096 seconds)
> 0: jdbc:drill:zk=10.10.100.183:5181> alter session set 
> `planner.memory.max_query_memory_per_node` = 14106127360;
> +-------+----------------------------------------------------+
> |  ok   |                      summary                       |
> +-------+----------------------------------------------------+
> | true  | planner.memory.max_query_memory_per_node updated.  |
> +-------+----------------------------------------------------+
> 1 row selected (0.253 seconds)
> 0: jdbc:drill:zk=10.10.100.183:5181> alter session set 
> `planner.width.max_per_node` = 1;
> +-------+--------------------------------------+
> |  ok   |               summary                |
> +-------+--------------------------------------+
> | true  | planner.width.max_per_node updated.  |
> +-------+--------------------------------------+
> 1 row selected (0.184 seconds)
> 0: jdbc:drill:zk=10.10.100.183:5181> select * from (select * from 
> dfs.`/drill/testdata/resource-manager/250wide.tbl` order by columns[0])d 
> where d.columns[0] = 'ljdfhwuehnoiueyf';
> {code}
> Once the OOM happens chaos follows
> {code}
> 1. Dangling fragments are left behind
> 2. Query fails but zookeeper thinks its still running
> 3. Client connection timeouts
> 4. Profile page shows the same query as both running and failed.
> {code}
> We should be handling this situation more gracefully as this could be 
> perceived as a drillbit stability issue. I attached the jstack. The logs and 
> data set used are too big to upload here. Reach out to me if you need more 
> information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to