[ 
https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385613#comment-14385613
 ] 

Rajesh Balamohan commented on TEZ-2237:
---------------------------------------

It is not due to TEZ-2214 [~hitesh].  Issue is not related to BTSE as well, as 
the tasks are getting completed after lots of spills.

In "appmaster____syslog_dag_1427282048097_0237_1.red.txt" (DAG_0237),  I see 
that the RM was not reachable at the end and DAGAppMaster's shutdownhook was 
invoked.

[~cchepelov] - Were there any RM issues when these DAGs were running?

{noformat}
2015-03-25 23:53:11,730 INFO [AMRM Heartbeater thread] 
client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
....
2015-03-25 23:53:14,883 INFO [AMRM Heartbeater thread] 
client.ConfiguredRMFailoverProxyProvider: Failing over to rm3
....
2015-03-25 23:53:18,588 INFO [AMRM Heartbeater thread] 
client.ConfiguredRMFailoverProxyProvider: Failing over to rm5
....
2015-03-25 23:53:22,922 INFO [AMRM Heartbeater thread] 
client.ConfiguredRMFailoverProxyProvider: Failing over to vip
....
2015-03-25 23:53:34,584 INFO [AMRM Heartbeater thread] 
client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
....
2015-03-25 23:53:51,705 INFO [AMRM Heartbeater thread] 
client.ConfiguredRMFailoverProxyProvider: Failing over to rm3
...
...
2015-03-25 23:53:53,712 INFO [AMRM Heartbeater thread] 
retry.RetryInvocationHandler: Exception while invoking allocate of class 
ApplicationMasterProtocolPBClientImpl over rm3 after 6 fail over attempts. 
Trying to fail over after sleeping for 17691ms.
java.net.ConnectException: Call From 
orc4.lan.par.transparencyrights.com/10.0.1.65 to orc3:8030 failed on connection 
exception: java.net.ConnectException: Connexion refusée; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
        at org.apache.hadoop.ipc.Client.call(Client.java:1472)
        at org.apache.hadoop.ipc.Client.call(Client.java:1399)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
        at com.sun.proxy.$Proxy14.allocate(Unknown Source)
        at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy15.allocate(Unknown Source)
        at 
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278)
        at 
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
Caused by: java.net.ConnectException: Connexion refusée
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
        at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
        at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
        at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
        at org.apache.hadoop.ipc.Client.call(Client.java:1438)
        ... 12 more
2015-03-25 23:53:55,654 INFO [Thread-1] app.DAGAppMaster: 
DAGAppMasterShutdownHook invoked
2015-03-25 23:53:55,654 INFO [Thread-1] app.DAGAppMaster: DAGAppMaster received 
a signal. Signaling TaskScheduler
2015-03-25 23:53:55,654 INFO [Thread-1] rm.TaskSchedulerEventHandler: 
TaskScheduler notified that iSignalled was : true
2015-03-25 23:53:55,656 INFO [Thread-1] history.HistoryEventHandler: Stopping 
HistoryEventHandler
2015-03-25 23:53:55,656 INFO [Thread-1] recovery.RecoveryService: Stopping 
RecoveryService
2015-03-25 23:53:55,656 INFO [Thread-1] recovery.RecoveryService: Closing 
Summary Stream
{noformat}

> BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG 
> lingers
> -------------------------------------------------------------------------------
>
>                 Key: TEZ-2237
>                 URL: https://issues.apache.org/jira/browse/TEZ-2237
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>         Environment: Debian Linux "jessie"
> OpenJDK Runtime Environment (build 1.8.0_40-internal-b27)
> OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode)
> 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system 
> disk + 4*1 or 2 TiB HDD for HDFS & local  (on-prem, dedicated hardware)
> Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 
> to run Cascading 3.0.0-wip-90 with TEZ 0.6.0
>            Reporter: Cyrille Chépélov
>         Attachments: all_stacks.lst, 
> appmaster____syslog_dag_1427282048097_0215_1.red.txt.gz, 
> appmaster____syslog_dag_1427282048097_0237_1.red.txt.gz, 
> syslog_attempt_1427282048097_0215_1_21_000014_0.red.txt.gz, 
> syslog_attempt_1427282048097_0237_1_70_000028_0.red.txt.gz
>
>
> On a specific DAG with many vertices (actually part of a larger meta-DAG), 
> after about a hour of processing, several BufferTooSmallException are raised 
> in UnorderedPartitionedKVWriter (about one every two or three spills).
> Once these exceptions are raised, the DAG remains indefinitely "active", 
> tying up memory and CPU resources as far as YARN is concerned, while little 
> if any actual processing takes place. 
> It seems two separate issues are at hand:
>   1. BufferTooSmallException are raised even though, small as the actually 
> allocated buffers seem to be (around a couple megabytes were allotted whereas 
> 100MiB were requested), the actual keys and values are never bigger than 24 
> and 1024 bytes respectively.
>   2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop 
> (stop requests appear to be sent 7 hours after the BTSE exceptions are 
> raised, but 9 hours after these stop requests, the DAG was still lingering on 
> with all containers present tying up memory and CPU allocations)
> The emergence of the BTSE prevent the Cascade to complete, preventing from 
> validating the results compared to traditional MR1-based results. The lack of 
> conclusion renders the cluster queue unavailable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to