Victoria Markman created DRILL-4272:
---------------------------------------
Summary: When sort runs out of memory and query fails, resources
are seemingly not freed
Key: DRILL-4272
URL: https://issues.apache.org/jira/browse/DRILL-4272
Project: Apache Drill
Issue Type: Bug
Components: Execution - Relational Operators
Affects Versions: 1.5.0
Reporter: Victoria Markman
Priority: Critical
Executed query11.sql from resources/Advanced/tpcds/tpcds_sf1/original/parquet
Query runs out of memory:
{code}
Error: RESOURCE ERROR: One or more nodes ran out of memory while executing the
query.
Unable to allocate sv2 for 32768 records, and not enough batchGroups to spill.
batchGroups.size 1
spilledBatchGroups.size 0
allocated memory 19961472
allocator limit 20000000
Fragment 19:0
[Error Id: 87aa32b8-17eb-488e-90cb-5f5bffff9aec on atsqa4-133.qa.lab:31010]
(state=,code=0)
{code}
And leaves fragments running, holding resources:
{code}
2016-01-14 22:46:32,435 [Drillbit-ShutdownHook#0] INFO
o.apache.drill.exec.server.Drillbit - Received shutdown request.
2016-01-14 22:46:32,546 [Curator-ServiceCache-0] WARN
o.a.d.e.w.fragment.FragmentExecutor - Foreman atsqa4-136.qa.lab no longer
active. Cancelling fragment 2967db08-cd38-925a-4960-9e881f537af8:19:0.
2016-01-14 22:46:32,547 [Curator-ServiceCache-0] INFO
o.a.d.e.w.fragment.FragmentExecutor -
2967db08-cd38-925a-4960-9e881f537af8:19:0: State change requested
CANCELLATION_REQUESTED --> CANCELLATION_REQUESTED
2016-01-14 22:46:32,547 [Curator-ServiceCache-0] WARN
o.a.d.e.w.fragment.FragmentExecutor -
2967db08-cd38-925a-4960-9e881f537af8:19:0: Ignoring unexpected state transition
CANCELLATION_REQUESTED --> CANCELLATION_REQUESTED
2016-01-14 22:46:32,547 [Curator-ServiceCache-0] WARN
o.a.d.e.w.fragment.FragmentExecutor - Foreman atsqa4-136.qa.lab no longer
active. Cancelling fragment 2967db08-cd38-925a-4960-9e881f537af8:17:0.
2016-01-14 22:46:32,547 [Curator-ServiceCache-0] INFO
o.a.d.e.w.fragment.FragmentExecutor -
2967db08-cd38-925a-4960-9e881f537af8:17:0: State change requested
CANCELLATION_REQUESTED --> CANCELLATION_REQUESTED
2016-01-14 22:46:32,547 [Curator-ServiceCache-0] WARN
o.a.d.e.w.fragment.FragmentExecutor -
2967db08-cd38-925a-4960-9e881f537af8:17:0: Ignoring unexpected state transition
CANCELLATION_REQUESTED --> CANCELLATION_REQUESTED
2016-01-14 22:46:33,563 [BitServer-1] INFO
o.a.d.exec.rpc.control.ControlClient - Channel closed /10.10.88.134:59069 <-->
atsqa4-136.qa.lab/10.10.88.136:31011.
2016-01-14 22:46:33,563 [BitClient-1] INFO o.a.drill.exec.rpc.data.DataClient
- Channel closed /10.10.88.134:34802 <--> atsqa4-136.qa.lab/10.10.88.136:31012.
2016-01-14 22:46:33,590 [BitClient-1] INFO o.a.drill.exec.rpc.data.DataClient
- Channel closed /10.10.88.134:36937 <--> atsqa4-135.qa.lab/10.10.88.135:31012.
2016-01-14 22:46:33,595 [BitClient-1] INFO o.a.drill.exec.rpc.data.DataClient
- Channel closed /10.10.88.134:53860 <--> atsqa4-133.qa.lab/10.10.88.133:31012.
2016-01-14 22:46:38,467 [BitClient-1] INFO o.a.drill.exec.rpc.data.DataClient
- Channel closed /10.10.88.134:48276 <--> atsqa4-134.qa.lab/10.10.88.134:31012.
2016-01-14 22:46:39,470 [pool-6-thread-1] INFO
o.a.drill.exec.rpc.user.UserServer - closed eventLoopGroup
io.netty.channel.nio.NioEventLoopGroup@6fb32dfb in 1003 ms
2016-01-14 22:46:39,470 [pool-6-thread-2] INFO
o.a.drill.exec.rpc.data.DataServer - closed eventLoopGroup
io.netty.channel.nio.NioEventLoopGroup@5c93dd80 in 1003 ms
2016-01-14 22:46:39,470 [pool-6-thread-1] INFO
o.a.drill.exec.service.ServiceEngine - closed userServer in 1004 ms
2016-01-14 22:46:39,470 [pool-6-thread-2] INFO
o.a.drill.exec.service.ServiceEngine - closed dataPool in 1005 ms
2016-01-14 22:46:39,483 [Drillbit-ShutdownHook#0] WARN
o.apache.drill.exec.work.WorkManager - Closing WorkManager but there are 2
running fragments.
2016-01-14 22:46:41,489 [Drillbit-ShutdownHook#0] ERROR
o.a.d.exec.server.BootStrapContext - Pool did not terminate
2016-01-14 22:46:41,498 [Drillbit-ShutdownHook#0] WARN
o.apache.drill.exec.server.Drillbit - Failure on close()
java.lang.RuntimeException: Exception while closing
at
org.apache.drill.common.DrillAutoCloseables.closeNoChecked(DrillAutoCloseables.java:46)
~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
at
org.apache.drill.exec.server.BootStrapContext.close(BootStrapContext.java:127)
~[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:76)
~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:64)
~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
at org.apache.drill.exec.server.Drillbit.close(Drillbit.java:149)
~[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
at
org.apache.drill.exec.server.Drillbit$ShutdownThread.run(Drillbit.java:243)
[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
Caused by: java.util.ConcurrentModificationException: null
at
java.util.IdentityHashMap$IdentityHashMapIterator.nextIndex(IdentityHashMap.java:732)
~[na:1.7.0_71]
at java.util.IdentityHashMap$KeyIterator.next(IdentityHashMap.java:822)
~[na:1.7.0_71]
at
org.apache.drill.exec.memory.BaseAllocator.print(BaseAllocator.java:683)
~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
at
org.apache.drill.exec.memory.BaseAllocator.print(BaseAllocator.java:679)
~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
at
org.apache.drill.exec.memory.BaseAllocator.print(BaseAllocator.java:679)
~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
at
org.apache.drill.exec.memory.BaseAllocator.toString(BaseAllocator.java:483)
~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
at
org.apache.drill.exec.memory.BaseAllocator.close(BaseAllocator.java:431)
~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
at
org.apache.drill.common.DrillAutoCloseables.closeNoChecked(DrillAutoCloseables.java:44)
~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
... 5 common frames omitted
2016-01-14 22:46:41,499 [Drillbit-ShutdownHook#0] INFO
o.apache.drill.exec.server.Drillbit - Shutdown completed (9062 ms).
{code}
My setup (I'm sure it can be scaled down to one node)
* 4 node cluster
* each box is 32 cores
* 48 GB direct memory
* 10GB memory allocated to sort
Drill settings:
{code}
0: jdbc:drill:schema=dfs> select * from sys.options where status like
'%CHANGED%';
+-------------------------------------------+----------+---------+----------+--------------+-------------+-----------+------------+
| name | kind | type | status |
num_val | string_val | bool_val | float_val |
+-------------------------------------------+----------+---------+----------+--------------+-------------+-----------+------------+
| planner.enable_decimal_data_type | BOOLEAN | SYSTEM | CHANGED |
null | null | true | null |
| planner.enable_hashjoin | BOOLEAN | SYSTEM | CHANGED |
null | null | false | null |
| planner.memory.max_query_memory_per_node | LONG | SYSTEM | CHANGED |
10737418240 | null | null | null |
+-------------------------------------------+----------+---------+----------+--------------+-------------+-----------+------------+
3 rows selected (3.532 seconds)
{code}
Or maybe we don't wait for anything to complete and just shut everything down ?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)