[ 
https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386297#comment-14386297
 ] 

Cyrille Chépélov commented on TEZ-2237:
---------------------------------------

Good catch, [~rajesh.balamohan], I didn't see this before. Something definitely 
happened around 23:53. On !gc_count_MRAppMaster.png!, one can see a quite 
GC-happy appmaster (probably DAGMaster) stopping. Simultaneously, a lot of 
trafic starts happenning on the DFS, at a container level as evidenced by 
!start_containers.png!, !stop_containers.png!, !alloc_mem.png!, 
!alloc_vcores.png!

[yarn_rm_flips.png] shows that indeed, the RM torch passes from rm2 to rm5, 
with the vip following shortly afterwards _(to give context: three machines, 
orc2, orc3, orc5 are designated to carry the RM mission, named rm2, rm3 and rm5 
respectively in the YARN HA setup. A Keepalived-based virtual IP goes wherever 
the NM is ACTIVE so that YARN clients can contact the correct host first, 
before the YARN HA client protocol has to kick in)_.

I was a bit concerned memory could have been tight, but according to 
[mem_free.png], things should be fine.

Looking at the RM logs on orc2, indeed RM has died, apparently suddenly, and 
was restarted by systemd:

{noformat}
2015-03-25 23:53:08,987 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Trying to f
ulfill reservation for application application_1427282048097_0363 on node: 
orc3.lan.par.transparencyrights.com:8242
2015-03-25 23:53:08,988 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Reserved container 
 application=application_1427282048097_0363 resource=<memory:2304, vCores:1> 
queue=test: capacity=0.2, absoluteCapacity=0.2,
 usedResources=<memory:25344, vCores:15>, usedCapacity=1.4171326, 
absoluteUsedCapacity=0.28342018, numApps=22, numContainers
=15 usedCapacity=1.4171326 absoluteUsedCapacity=0.28342018 used=<memory:25344, 
vCores:15> cluster=<memory:89422, vCores:56>
2015-03-25 23:53:08,988 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Skipping sc
heduling since node orc3.lan.par.transparencyrights.com:8242 is reserved by 
application appattempt_1427282048097_0363_000001
2015-03-25 23:53:09,243 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for 
port 8030: readAndProcess from client 10.0.1
.143 threw exception [java.io.IOException: Connection reset by peer]
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
        at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
        at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
        at 
org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
        at 
org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
2015-03-25 23:53:09,334 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for 
port 8031: readAndProcess from client 10.0.1
.143 threw exception [java.io.IOException: Connection reset by peer]
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
        at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
        at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
        at 
org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
        at 
org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
2015-03-25 23:53:10,106 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting ResourceManager
STARTUP_MSG:   host = orc2/10.0.1.64
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.6.0
[snip]
2015-03-25 23:53:11,939 INFO org.apache.hadoop.ipc.Server: IPC Server 
Responder: starting
2015-03-25 23:53:11,939 INFO org.apache.hadoop.ipc.Server: IPC Server listener 
on 8033: starting
2015-03-25 23:53:11,948 INFO org.apache.hadoop.conf.Configuration: found 
resource yarn-site.xml at file:/opt/hadoop-2.6.0/conf-yarn-rm/yarn-site.xml
2015-03-25 23:53:11,950 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn-rm  
OPERATION=refreshAdminAcls      TARGET=AdminService     RESULT=SUCCESS
2015-03-25 23:53:11,953 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Already in 
standby state
2015-03-25 23:53:11,953 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn-rm  
OPERATION=transitionToStandby   TARGET=RMHAProtocolService      RESULT=SUCCESS
2015-03-26 00:03:11,151 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
Release request cache is cleaned up
{noformat}

(it is still STANDBY as of this writing). Strangely, a second RM has also 
restarted simultaneously, on orc3; while orc5 survives.


I have seen no such event on 2015-03-27, where the !all_stacks.lst! snapshot 
was taken (separate instance of the same behaviour)

The action plan I intend to follow now:
# Start again the same binaries as on Friday 2015-03-27, watching for RM 
flip-overs.
# Try again using latest branch-0.6 libraries, to rule out TEZ-2214 or anything 
else recently fixed.
# Report   
  

> BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG 
> lingers
> -------------------------------------------------------------------------------
>
>                 Key: TEZ-2237
>                 URL: https://issues.apache.org/jira/browse/TEZ-2237
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>         Environment: Debian Linux "jessie"
> OpenJDK Runtime Environment (build 1.8.0_40-internal-b27)
> OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode)
> 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system 
> disk + 4*1 or 2 TiB HDD for HDFS & local  (on-prem, dedicated hardware)
> Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 
> to run Cascading 3.0.0-wip-90 with TEZ 0.6.0
>            Reporter: Cyrille Chépélov
>         Attachments: all_stacks.lst, alloc_mem.png, alloc_vcores.png, 
> appmaster____syslog_dag_1427282048097_0215_1.red.txt.gz, 
> appmaster____syslog_dag_1427282048097_0237_1.red.txt.gz, 
> gc_count_MRAppMaster.png, mem_free.png, start_containers.png, 
> stop_containers.png, 
> syslog_attempt_1427282048097_0215_1_21_000014_0.red.txt.gz, 
> syslog_attempt_1427282048097_0237_1_70_000028_0.red.txt.gz, yarn_rm_flips.png
>
>
> On a specific DAG with many vertices (actually part of a larger meta-DAG), 
> after about a hour of processing, several BufferTooSmallException are raised 
> in UnorderedPartitionedKVWriter (about one every two or three spills).
> Once these exceptions are raised, the DAG remains indefinitely "active", 
> tying up memory and CPU resources as far as YARN is concerned, while little 
> if any actual processing takes place. 
> It seems two separate issues are at hand:
>   1. BufferTooSmallException are raised even though, small as the actually 
> allocated buffers seem to be (around a couple megabytes were allotted whereas 
> 100MiB were requested), the actual keys and values are never bigger than 24 
> and 1024 bytes respectively.
>   2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop 
> (stop requests appear to be sent 7 hours after the BTSE exceptions are 
> raised, but 9 hours after these stop requests, the DAG was still lingering on 
> with all containers present tying up memory and CPU allocations)
> The emergence of the BTSE prevent the Cascade to complete, preventing from 
> validating the results compared to traditional MR1-based results. The lack of 
> conclusion renders the cluster queue unavailable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to