[
https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386297#comment-14386297
]
Cyrille Chépélov commented on TEZ-2237:
---------------------------------------
Good catch, [~rajesh.balamohan], I didn't see this before. Something definitely
happened around 23:53. On !gc_count_MRAppMaster.png!, one can see a quite
GC-happy appmaster (probably DAGMaster) stopping. Simultaneously, a lot of
trafic starts happenning on the DFS, at a container level as evidenced by
!start_containers.png!, !stop_containers.png!, !alloc_mem.png!,
!alloc_vcores.png!
[yarn_rm_flips.png] shows that indeed, the RM torch passes from rm2 to rm5,
with the vip following shortly afterwards _(to give context: three machines,
orc2, orc3, orc5 are designated to carry the RM mission, named rm2, rm3 and rm5
respectively in the YARN HA setup. A Keepalived-based virtual IP goes wherever
the NM is ACTIVE so that YARN clients can contact the correct host first,
before the YARN HA client protocol has to kick in)_.
I was a bit concerned memory could have been tight, but according to
[mem_free.png], things should be fine.
Looking at the RM logs on orc2, indeed RM has died, apparently suddenly, and
was restarted by systemd:
{noformat}
2015-03-25 23:53:08,987 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Trying to f
ulfill reservation for application application_1427282048097_0363 on node:
orc3.lan.par.transparencyrights.com:8242
2015-03-25 23:53:08,988 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
Reserved container
application=application_1427282048097_0363 resource=<memory:2304, vCores:1>
queue=test: capacity=0.2, absoluteCapacity=0.2,
usedResources=<memory:25344, vCores:15>, usedCapacity=1.4171326,
absoluteUsedCapacity=0.28342018, numApps=22, numContainers
=15 usedCapacity=1.4171326 absoluteUsedCapacity=0.28342018 used=<memory:25344,
vCores:15> cluster=<memory:89422, vCores:56>
2015-03-25 23:53:08,988 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Skipping sc
heduling since node orc3.lan.par.transparencyrights.com:8242 is reserved by
application appattempt_1427282048097_0363_000001
2015-03-25 23:53:09,243 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for
port 8030: readAndProcess from client 10.0.1
.143 threw exception [java.io.IOException: Connection reset by peer]
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
at
org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
at
org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
2015-03-25 23:53:09,334 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for
port 8031: readAndProcess from client 10.0.1
.143 threw exception [java.io.IOException: Connection reset by peer]
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
at
org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
at
org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
2015-03-25 23:53:10,106 INFO
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting ResourceManager
STARTUP_MSG: host = orc2/10.0.1.64
STARTUP_MSG: args = []
STARTUP_MSG: version = 2.6.0
[snip]
2015-03-25 23:53:11,939 INFO org.apache.hadoop.ipc.Server: IPC Server
Responder: starting
2015-03-25 23:53:11,939 INFO org.apache.hadoop.ipc.Server: IPC Server listener
on 8033: starting
2015-03-25 23:53:11,948 INFO org.apache.hadoop.conf.Configuration: found
resource yarn-site.xml at file:/opt/hadoop-2.6.0/conf-yarn-rm/yarn-site.xml
2015-03-25 23:53:11,950 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn-rm
OPERATION=refreshAdminAcls TARGET=AdminService RESULT=SUCCESS
2015-03-25 23:53:11,953 INFO
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Already in
standby state
2015-03-25 23:53:11,953 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn-rm
OPERATION=transitionToStandby TARGET=RMHAProtocolService RESULT=SUCCESS
2015-03-26 00:03:11,151 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
Release request cache is cleaned up
{noformat}
(it is still STANDBY as of this writing). Strangely, a second RM has also
restarted simultaneously, on orc3; while orc5 survives.
I have seen no such event on 2015-03-27, where the !all_stacks.lst! snapshot
was taken (separate instance of the same behaviour)
The action plan I intend to follow now:
# Start again the same binaries as on Friday 2015-03-27, watching for RM
flip-overs.
# Try again using latest branch-0.6 libraries, to rule out TEZ-2214 or anything
else recently fixed.
# Report
> BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG
> lingers
> -------------------------------------------------------------------------------
>
> Key: TEZ-2237
> URL: https://issues.apache.org/jira/browse/TEZ-2237
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Environment: Debian Linux "jessie"
> OpenJDK Runtime Environment (build 1.8.0_40-internal-b27)
> OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode)
> 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system
> disk + 4*1 or 2 TiB HDD for HDFS & local (on-prem, dedicated hardware)
> Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220
> to run Cascading 3.0.0-wip-90 with TEZ 0.6.0
> Reporter: Cyrille Chépélov
> Attachments: all_stacks.lst, alloc_mem.png, alloc_vcores.png,
> appmaster____syslog_dag_1427282048097_0215_1.red.txt.gz,
> appmaster____syslog_dag_1427282048097_0237_1.red.txt.gz,
> gc_count_MRAppMaster.png, mem_free.png, start_containers.png,
> stop_containers.png,
> syslog_attempt_1427282048097_0215_1_21_000014_0.red.txt.gz,
> syslog_attempt_1427282048097_0237_1_70_000028_0.red.txt.gz, yarn_rm_flips.png
>
>
> On a specific DAG with many vertices (actually part of a larger meta-DAG),
> after about a hour of processing, several BufferTooSmallException are raised
> in UnorderedPartitionedKVWriter (about one every two or three spills).
> Once these exceptions are raised, the DAG remains indefinitely "active",
> tying up memory and CPU resources as far as YARN is concerned, while little
> if any actual processing takes place.
> It seems two separate issues are at hand:
> 1. BufferTooSmallException are raised even though, small as the actually
> allocated buffers seem to be (around a couple megabytes were allotted whereas
> 100MiB were requested), the actual keys and values are never bigger than 24
> and 1024 bytes respectively.
> 2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop
> (stop requests appear to be sent 7 hours after the BTSE exceptions are
> raised, but 9 hours after these stop requests, the DAG was still lingering on
> with all containers present tying up memory and CPU allocations)
> The emergence of the BTSE prevent the Cascade to complete, preventing from
> validating the results compared to traditional MR1-based results. The lack of
> conclusion renders the cluster queue unavailable.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)