Thanks Steve and Prabhu for the information.

The cause turned out to be locking in CapacityScheduler#reinitialize.
I think the method is called after transitioning to active stat if RM-HA is enabled.

I filed YARN-10347 and created PR.


Masatake Iwasaki


On 2020/07/08 16:33, Prabhu Joseph wrote:
Hi Masatake,

      The thread is waiting for a ReadLock, we need to check what the other
thread holding WriteLock is blocked on.
Can you get three consecutive complete jstack of ResourceManager during the
issue.

I got no issue if RM-HA is disabled.
Looks RM is not able to access Zookeeper State Store. Can you check if
there is any connectivity issue between RM and Zookeeper.

Thanks,
Prabhu Joseph


On Mon, Jul 6, 2020 at 2:44 AM Masatake Iwasaki <iwasak...@oss.nttdata.co.jp>
wrote:

Thanks for putting this up, Gabor Bota.

I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA enabled.
ResourceManager reproducibly blocks on submitApplication while launching
example MR jobs.
Does anyone run into the same issue?

The same configuration worked for 3.1.3.
I got no issue if RM-HA is disabled.


"IPC Server handler 1 on default port 8032" #167 daemon prio=5 os_prio=0
tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition [0x00007fe901bac000]
     java.lang.Thread.State: WAITING (parking)
          at sun.misc.Unsafe.park(Native Method)
          - parking to wait for  <0x0000000085d37a40> (a
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
          at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
          at

java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
          at

java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
          at

java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
          at

java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
          at

org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
          at

org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
          at

org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
          at

org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
          at

org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
          at

org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
          at

org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:422)
          at

org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)


Masatake Iwasaki

On 2020/06/26 22:51, Gabor Bota wrote:
Hi folks,

I have put together a release candidate (RC2) for Hadoop 3.1.4.

The RC is available at:
http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
The RC tag in git is here:
https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
The maven artifacts are staged at
https://repository.apache.org/content/repositories/orgapachehadoop-1269/

You can find my public key at:
https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C

Please try the release and vote. The vote will run for 5 weekdays,
until July 6. 2020. 23:00 CET.

The release includes the revert of HDFS-14941, as it caused
HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
(https://issues.apache.org/jira/browse/HDFS-15421)
The release includes HDFS-15323, as requested.
(https://issues.apache.org/jira/browse/HDFS-15323)

Thanks,
Gabor

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to