[jira] [Updated] (AMBARI-17106) Deadlock While Updating Stale Configuration Cache During Upgrade

Jonathan Hurley (JIRA) Wed, 08 Jun 2016 08:04:40 -0700

     [ 
https://issues.apache.org/jira/browse/AMBARI-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jonathan Hurley updated AMBARI-17106:
-------------------------------------
    Status: Patch Available  (was: Open)

> Deadlock While Updating Stale Configuration Cache During Upgrade
> ----------------------------------------------------------------
>
>                 Key: AMBARI-17106
>                 URL: https://issues.apache.org/jira/browse/AMBARI-17106
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.4.0
>            Reporter: Jonathan Hurley
>            Assignee: Jonathan Hurley
>            Priority: Blocker
>             Fix For: 2.4.0
>
>         Attachments: AMBARI-17106.patch
>
>
> ambari-server --hash
> dc340e8c6cb4fa6c062f805cc1917f62299a5f50
> ambari-server-2.4.0.0-622.x86_64
> *Steps*
> # Deploy HDP-2.4.0.0 cluster with Ambari 2.4.0.0 (unsecure, non-HA cluster, 
> SSL enabled)
> # Start EU to HDP-2.5.0.0-609
>  
> *Result*
> While EU is in progress, found that Ambari server seems to have hung; the 
> login page loads, but unable to login; The following API call hangs too -- 
> https://server:8443/api/v1/clusters/cl1/
> There is a deadlock when trying to update the stale configuration cache:
> {code}
> "Server Action Executor Worker 401" #225 prio=5 os_prio=0 
> tid=0x00007fa07c03e800 nid=0x65df waiting on condition [0x00007fa0737ef000]
>    java.lang.Thread.State: WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x00000000a059d4f0> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
>       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>       at 
> java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
>   --> @TransactionalLock(lockArea = LockArea.STALE_CONFIG_CACHE, lockType = 
> LockType.WRITE)
>   at 
> org.apache.ambari.server.orm.AmbariJpaLocalTxnInterceptor.lockTransaction(AmbariJpaLocalTxnInterceptor.java:291)
>       at 
> org.apache.ambari.server.orm.AmbariJpaLocalTxnInterceptor.invoke(AmbariJpaLocalTxnInterceptor.java:114)
>       at 
> com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
>       at 
> com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
>       at 
> org.apache.ambari.server.state.cluster.ClusterImpl$$EnhancerByGuice$$991b84fc.applyConfigs(<generated>)
>   --> clusterGlobalLock.writeLock().lock();
>   at 
> org.apache.ambari.server.state.cluster.ClusterImpl.addDesiredConfig(ClusterImpl.java:2340)
>       at 
> org.apache.ambari.server.state.ConfigHelper.createConfigTypes(ConfigHelper.java:897)
>       at 
> org.apache.ambari.server.controller.internal.UpgradeResourceProvider.applyStackAndProcessConfigurations(UpgradeResourceProvider.java:1174)
> {code}
> {code}
> "ambari-hearbeat-monitor" #23 prio=5 os_prio=0 tid=0x00007fa07476c000 
> nid=0x20ad waiting on condition [0x00007fa07bbfb000]
>    java.lang.Thread.State: WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x00000000a32d44a0> (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>       at 
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>       at 
> org.apache.ambari.server.state.cluster.ClusterImpl.getDesiredStackVersion(ClusterImpl.java:1052)
>       at 
> org.apache.ambari.server.state.ConfigHelper.calculateIsStaleConfigs(ConfigHelper.java:1075)
>   --> @TransactionalLock(lockArea = LockArea.STALE_CONFIG_CACHE, lockType = 
> LockType.READ)
>   
>       at 
> org.apache.ambari.server.state.ConfigHelper.isStaleConfigs(ConfigHelper.java:456)
>       at 
> org.apache.ambari.server.agent.HeartbeatMonitor.createStatusCommand(HeartbeatMonitor.java:311)
> {code}
> This is another case of an Ambari cache competing with a JPA transaction. 
> Consider these steps:
> - A new configuration is created within the context of a Transaction
> - Within that same Transaction, the stale configuration cache is told to 
> invalidate
> - After purging the old data, but before the Transaction is committed, 
> another thread tries to read from the cache. It ends up re-populating the old 
> data.
> Sometimes the code works because the Transaction is able to committ before 
> the cache is re-populated by another thread. In theory, we should be locking 
> around reading the cache to ensure that there isn't a transaction writing to 
> it. However, this is what caused the deadlock since it interferes with our 
> wonder "cluster global lock of doom".
> Instead, it's safer in this case to just invalidate the cache after the 
> Transaction completes. 
> - We do this invalidate on a separate thread to ensure we don't have issues 
> with the cluster global lock
> - Since the cache isn't needed within the context of the invalidation call, 
> it's OK to purge it asynchronously.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (AMBARI-17106) Deadlock While Updating Stale Configuration Cache During Upgrade

Reply via email to