[
https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385613#comment-14385613
]
Rajesh Balamohan commented on TEZ-2237:
---------------------------------------
It is not due to TEZ-2214 [~hitesh]. Issue is not related to BTSE as well, as
the tasks are getting completed after lots of spills.
In "appmaster____syslog_dag_1427282048097_0237_1.red.txt" (DAG_0237), I see
that the RM was not reachable at the end and DAGAppMaster's shutdownhook was
invoked.
[~cchepelov] - Were there any RM issues when these DAGs were running?
{noformat}
2015-03-25 23:53:11,730 INFO [AMRM Heartbeater thread]
client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
....
2015-03-25 23:53:14,883 INFO [AMRM Heartbeater thread]
client.ConfiguredRMFailoverProxyProvider: Failing over to rm3
....
2015-03-25 23:53:18,588 INFO [AMRM Heartbeater thread]
client.ConfiguredRMFailoverProxyProvider: Failing over to rm5
....
2015-03-25 23:53:22,922 INFO [AMRM Heartbeater thread]
client.ConfiguredRMFailoverProxyProvider: Failing over to vip
....
2015-03-25 23:53:34,584 INFO [AMRM Heartbeater thread]
client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
....
2015-03-25 23:53:51,705 INFO [AMRM Heartbeater thread]
client.ConfiguredRMFailoverProxyProvider: Failing over to rm3
...
...
2015-03-25 23:53:53,712 INFO [AMRM Heartbeater thread]
retry.RetryInvocationHandler: Exception while invoking allocate of class
ApplicationMasterProtocolPBClientImpl over rm3 after 6 fail over attempts.
Trying to fail over after sleeping for 17691ms.
java.net.ConnectException: Call From
orc4.lan.par.transparencyrights.com/10.0.1.65 to orc3:8030 failed on connection
exception: java.net.ConnectException: Connexion refusée; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy14.allocate(Unknown Source)
at
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy15.allocate(Unknown Source)
at
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278)
at
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
Caused by: java.net.ConnectException: Connexion refusée
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
at org.apache.hadoop.ipc.Client.call(Client.java:1438)
... 12 more
2015-03-25 23:53:55,654 INFO [Thread-1] app.DAGAppMaster:
DAGAppMasterShutdownHook invoked
2015-03-25 23:53:55,654 INFO [Thread-1] app.DAGAppMaster: DAGAppMaster received
a signal. Signaling TaskScheduler
2015-03-25 23:53:55,654 INFO [Thread-1] rm.TaskSchedulerEventHandler:
TaskScheduler notified that iSignalled was : true
2015-03-25 23:53:55,656 INFO [Thread-1] history.HistoryEventHandler: Stopping
HistoryEventHandler
2015-03-25 23:53:55,656 INFO [Thread-1] recovery.RecoveryService: Stopping
RecoveryService
2015-03-25 23:53:55,656 INFO [Thread-1] recovery.RecoveryService: Closing
Summary Stream
{noformat}
> BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG
> lingers
> -------------------------------------------------------------------------------
>
> Key: TEZ-2237
> URL: https://issues.apache.org/jira/browse/TEZ-2237
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.6.0
> Environment: Debian Linux "jessie"
> OpenJDK Runtime Environment (build 1.8.0_40-internal-b27)
> OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode)
> 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system
> disk + 4*1 or 2 TiB HDD for HDFS & local (on-prem, dedicated hardware)
> Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220
> to run Cascading 3.0.0-wip-90 with TEZ 0.6.0
> Reporter: Cyrille Chépélov
> Attachments: all_stacks.lst,
> appmaster____syslog_dag_1427282048097_0215_1.red.txt.gz,
> appmaster____syslog_dag_1427282048097_0237_1.red.txt.gz,
> syslog_attempt_1427282048097_0215_1_21_000014_0.red.txt.gz,
> syslog_attempt_1427282048097_0237_1_70_000028_0.red.txt.gz
>
>
> On a specific DAG with many vertices (actually part of a larger meta-DAG),
> after about a hour of processing, several BufferTooSmallException are raised
> in UnorderedPartitionedKVWriter (about one every two or three spills).
> Once these exceptions are raised, the DAG remains indefinitely "active",
> tying up memory and CPU resources as far as YARN is concerned, while little
> if any actual processing takes place.
> It seems two separate issues are at hand:
> 1. BufferTooSmallException are raised even though, small as the actually
> allocated buffers seem to be (around a couple megabytes were allotted whereas
> 100MiB were requested), the actual keys and values are never bigger than 24
> and 1024 bytes respectively.
> 2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop
> (stop requests appear to be sent 7 hours after the BTSE exceptions are
> raised, but 9 hours after these stop requests, the DAG was still lingering on
> with all containers present tying up memory and CPU allocations)
> The emergence of the BTSE prevent the Cascade to complete, preventing from
> validating the results compared to traditional MR1-based results. The lack of
> conclusion renders the cluster queue unavailable.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)