[ 
https://issues.apache.org/jira/browse/DRILL-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15946183#comment-15946183
 ] 

Paul Rogers commented on DRILL-4272:
------------------------------------

Bug was filed in Jan. of 2016 against 1.5. Did you try that version? Else, we 
can assume that the bug has since been fixed.

The important point is that the problem does not occur in the current version.

We can change the resolution to "not a bug" or "can'r reproduce" if that is 
more descriptive...

> When sort runs out of memory and query fails, resources are seemingly not 
> freed
> -------------------------------------------------------------------------------
>
>                 Key: DRILL-4272
>                 URL: https://issues.apache.org/jira/browse/DRILL-4272
>             Project: Apache Drill
>          Issue Type: Sub-task
>          Components: Execution - Relational Operators
>    Affects Versions: 1.5.0
>            Reporter: Victoria Markman
>            Assignee: Paul Rogers
>            Priority: Critical
>             Fix For: 1.10.0
>
>
> Executed query11.sql from resources/Advanced/tpcds/tpcds_sf1/original/parquet
> Query runs out of memory:
> {code}
> Error: RESOURCE ERROR: One or more nodes ran out of memory while executing 
> the query.
> Unable to allocate sv2 for 32768 records, and not enough batchGroups to spill.
> batchGroups.size 1
> spilledBatchGroups.size 0
> allocated memory 19961472
> allocator limit 20000000
> Fragment 19:0
> [Error Id: 87aa32b8-17eb-488e-90cb-5f5bffff9aec on atsqa4-133.qa.lab:31010] 
> (state=,code=0)
> {code}
> And leaves fragments running, holding resources:
> {code}
> 2016-01-14 22:46:32,435 [Drillbit-ShutdownHook#0] INFO  
> o.apache.drill.exec.server.Drillbit - Received shutdown request.
> 2016-01-14 22:46:32,546 [Curator-ServiceCache-0] WARN  
> o.a.d.e.w.fragment.FragmentExecutor - Foreman atsqa4-136.qa.lab no longer 
> active.  Cancelling fragment 2967db08-cd38-925a-4960-9e881f537af8:19:0.
> 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] INFO  
> o.a.d.e.w.fragment.FragmentExecutor - 
> 2967db08-cd38-925a-4960-9e881f537af8:19:0: State change requested 
> CANCELLATION_REQUESTED --> CANCELLATION_REQUESTED
> 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] WARN  
> o.a.d.e.w.fragment.FragmentExecutor - 
> 2967db08-cd38-925a-4960-9e881f537af8:19:0: Ignoring unexpected state 
> transition CANCELLATION_REQUESTED --> CANCELLATION_REQUESTED
> 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] WARN  
> o.a.d.e.w.fragment.FragmentExecutor - Foreman atsqa4-136.qa.lab no longer 
> active.  Cancelling fragment 2967db08-cd38-925a-4960-9e881f537af8:17:0.
> 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] INFO  
> o.a.d.e.w.fragment.FragmentExecutor - 
> 2967db08-cd38-925a-4960-9e881f537af8:17:0: State change requested 
> CANCELLATION_REQUESTED --> CANCELLATION_REQUESTED
> 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] WARN  
> o.a.d.e.w.fragment.FragmentExecutor - 
> 2967db08-cd38-925a-4960-9e881f537af8:17:0: Ignoring unexpected state 
> transition CANCELLATION_REQUESTED --> CANCELLATION_REQUESTED
> 2016-01-14 22:46:33,563 [BitServer-1] INFO  
> o.a.d.exec.rpc.control.ControlClient - Channel closed /10.10.88.134:59069 
> <--> atsqa4-136.qa.lab/10.10.88.136:31011.
> 2016-01-14 22:46:33,563 [BitClient-1] INFO  
> o.a.drill.exec.rpc.data.DataClient - Channel closed /10.10.88.134:34802 <--> 
> atsqa4-136.qa.lab/10.10.88.136:31012.
> 2016-01-14 22:46:33,590 [BitClient-1] INFO  
> o.a.drill.exec.rpc.data.DataClient - Channel closed /10.10.88.134:36937 <--> 
> atsqa4-135.qa.lab/10.10.88.135:31012.
> 2016-01-14 22:46:33,595 [BitClient-1] INFO  
> o.a.drill.exec.rpc.data.DataClient - Channel closed /10.10.88.134:53860 <--> 
> atsqa4-133.qa.lab/10.10.88.133:31012.
> 2016-01-14 22:46:38,467 [BitClient-1] INFO  
> o.a.drill.exec.rpc.data.DataClient - Channel closed /10.10.88.134:48276 <--> 
> atsqa4-134.qa.lab/10.10.88.134:31012.
> 2016-01-14 22:46:39,470 [pool-6-thread-1] INFO  
> o.a.drill.exec.rpc.user.UserServer - closed eventLoopGroup 
> io.netty.channel.nio.NioEventLoopGroup@6fb32dfb in 1003 ms
> 2016-01-14 22:46:39,470 [pool-6-thread-2] INFO  
> o.a.drill.exec.rpc.data.DataServer - closed eventLoopGroup 
> io.netty.channel.nio.NioEventLoopGroup@5c93dd80 in 1003 ms
> 2016-01-14 22:46:39,470 [pool-6-thread-1] INFO  
> o.a.drill.exec.service.ServiceEngine - closed userServer in 1004 ms
> 2016-01-14 22:46:39,470 [pool-6-thread-2] INFO  
> o.a.drill.exec.service.ServiceEngine - closed dataPool in 1005 ms
> 2016-01-14 22:46:39,483 [Drillbit-ShutdownHook#0] WARN  
> o.apache.drill.exec.work.WorkManager - Closing WorkManager but there are 2 
> running fragments.
> 2016-01-14 22:46:41,489 [Drillbit-ShutdownHook#0] ERROR 
> o.a.d.exec.server.BootStrapContext - Pool did not terminate
> 2016-01-14 22:46:41,498 [Drillbit-ShutdownHook#0] WARN  
> o.apache.drill.exec.server.Drillbit - Failure on close()
> java.lang.RuntimeException: Exception while closing
>         at 
> org.apache.drill.common.DrillAutoCloseables.closeNoChecked(DrillAutoCloseables.java:46)
>  ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.server.BootStrapContext.close(BootStrapContext.java:127)
>  ~[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at 
> org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:76) 
> ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at 
> org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:64) 
> ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.close(Drillbit.java:149) 
> ~[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.server.Drillbit$ShutdownThread.run(Drillbit.java:243) 
> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> Caused by: java.util.ConcurrentModificationException: null
>         at 
> java.util.IdentityHashMap$IdentityHashMapIterator.nextIndex(IdentityHashMap.java:732)
>  ~[na:1.7.0_71]
>         at 
> java.util.IdentityHashMap$KeyIterator.next(IdentityHashMap.java:822) 
> ~[na:1.7.0_71]
>         at 
> org.apache.drill.exec.memory.BaseAllocator.print(BaseAllocator.java:683) 
> ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.memory.BaseAllocator.print(BaseAllocator.java:679) 
> ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.memory.BaseAllocator.print(BaseAllocator.java:679) 
> ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.memory.BaseAllocator.toString(BaseAllocator.java:483) 
> ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.memory.BaseAllocator.close(BaseAllocator.java:431) 
> ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at 
> org.apache.drill.common.DrillAutoCloseables.closeNoChecked(DrillAutoCloseables.java:44)
>  ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         ... 5 common frames omitted
> 2016-01-14 22:46:41,499 [Drillbit-ShutdownHook#0] INFO  
> o.apache.drill.exec.server.Drillbit - Shutdown completed (9062 ms).
> {code}
> or this stack trace:
> {code}
> 2016-01-15 00:59:09,699 [Drillbit-ShutdownHook#0] WARN  
> o.apache.drill.exec.work.WorkManager - Closing WorkManager but there are 1 
> running fragments.
> 2016-01-15 00:59:11,704 [Drillbit-ShutdownHook#0] ERROR 
> o.a.d.exec.server.BootStrapContext - Pool did not terminate
> 2016-01-15 00:59:11,708 [Drillbit-ShutdownHook#0] WARN  
> o.apache.drill.exec.server.Drillbit - Failure on close()
> java.lang.RuntimeException: Exception while closing
>         at 
> org.apache.drill.common.DrillAutoCloseables.closeNoChecked(DrillAutoCloseables.java:46)
>  ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.server.BootStrapContext.close(BootStrapContext.java:127)
>  ~[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at 
> org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:76) 
> ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at 
> org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:64) 
> ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.close(Drillbit.java:149) 
> ~[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.server.Drillbit$ShutdownThread.run(Drillbit.java:243) 
> [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> Caused by: java.lang.IllegalStateException: allocator[frag:19:0]: buffer 
> space (0) + prealloc space (0) + child space (111759688) != allocated 
> (111104328)
>         at 
> org.apache.drill.exec.memory.BaseAllocator.verifyAllocator(BaseAllocator.java:653)
>  ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.memory.BaseAllocator.verifyAllocator(BaseAllocator.java:557)
>  ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.memory.BaseAllocator.verifyAllocator(BaseAllocator.java:528)
>  ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at 
> org.apache.drill.exec.memory.BaseAllocator.close(BaseAllocator.java:419) 
> ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at 
> org.apache.drill.common.DrillAutoCloseables.closeNoChecked(DrillAutoCloseables.java:44)
>  ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         ... 5 common frames omitted
> 2016-01-15 00:59:11,708 [Drillbit-ShutdownHook#0] INFO  
> o.apache.drill.exec.server.Drillbit - Shutdown completed (9058 ms).
> {code}
> Fragment 19:0 is sort that failed with OOM:
> {code}
> 2016-01-15 00:58:00,552 [2967bc04-43e6-6e2c-e91f-4401daae16cc:frag:19:0] INFO 
>  o.a.d.e.p.i.xsort.ExternalSortBatch - User Error Occurred
> org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: One or more 
> nodes ran out of memory while executing the query.
> {code}
> My setup (I'm sure it can be scaled down to one node)
> * 4 node cluster
> * each box is 32 cores
> * 48 GB direct memory
> * 10GB memory allocated to sort
> Drill settings:
> {code}
> 0: jdbc:drill:schema=dfs> select * from sys.options where status like 
> '%CHANGED%';
> +-------------------------------------------+----------+---------+----------+--------------+-------------+-----------+------------+
> |                   name                    |   kind   |  type   |  status  | 
>   num_val    | string_val  | bool_val  | float_val  |
> +-------------------------------------------+----------+---------+----------+--------------+-------------+-----------+------------+
> | planner.enable_decimal_data_type          | BOOLEAN  | SYSTEM  | CHANGED  | 
> null         | null        | true      | null       |
> | planner.enable_hashjoin                   | BOOLEAN  | SYSTEM  | CHANGED  | 
> null         | null        | false     | null       |
> | planner.memory.max_query_memory_per_node  | LONG     | SYSTEM  | CHANGED  | 
> 10737418240  | null        | null      | null       |
> +-------------------------------------------+----------+---------+----------+--------------+-------------+-----------+------------+
> 3 rows selected (3.532 seconds)
> {code}
> Or maybe we don't wait for anything to complete and just shut everything down 
> ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to