[ 
https://issues.apache.org/jira/browse/AMBARI-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ximo Guanter updated AMBARI-4930:
---------------------------------

    Attachment: AMBARI-4930-1.patch

It seems like the cluster initialization cannot be run at the same time as the 
heartbeat monitoring process. Just moving the line that starts the heartbeat 
monitoring process after cluster initialization seems to solve the problem (see 
AMBARI-4930-1.patch).

Not that this should be the final fix for this issue, since both the heartbeat 
monitoring process and cluster initialization should be able to be run at the 
same time, but the attached patch provides a workaround that unblocks this 
issue.

My uneducated guess is that the root problem might be in the locking mechanism 
of ClusterImpl.java, but I don't have a deep enough understanding of the 
different classes and locks pinpoint the root cause.

> Ambari initialization problems after upgrade to 1.4.1
> -----------------------------------------------------
>
>                 Key: AMBARI-4930
>                 URL: https://issues.apache.org/jira/browse/AMBARI-4930
>             Project: Ambari
>          Issue Type: Bug
>    Affects Versions: 1.4.1
>            Reporter: Ximo Guanter
>         Attachments: AMBARI-4930-1.patch
>
>
> Starting the Ambari Server sometime fails with the following error
> {code}
> 04:44:56,972  INFO [main] Configuration:511 - Web App DIR test 
> /usr/lib/ambari-server/web
> 04:44:56,975  INFO [main] CertificateManager:70 - Initialization of root 
> certificate
> 04:44:56,975  INFO [main] CertificateManager:72 - Certificate exists:true
> 04:44:57,003  INFO [main] AmbariServer:338 - ********* Initializing Clusters 
> **********
> 04:44:57,285  WARN [Thread-2] HeartbeatMonitor:123 - Heartbeat lost from host 
> andromeda-compute02.hi.inet
> 04:44:57,295  WARN [Thread-2] HeartbeatMonitor:123 - Heartbeat lost from host 
> andromeda-compute03.hi.inet
> 04:44:57,296  WARN [Thread-2] HeartbeatMonitor:123 - Heartbeat lost from host 
> andromeda-compute06.hi.inet
> 04:44:57,296  WARN [Thread-2] HeartbeatMonitor:123 - Heartbeat lost from host 
> andromeda-compute04.hi.inet
> 04:44:57,297  WARN [Thread-2] HeartbeatMonitor:123 - Heartbeat lost from host 
> andromeda-data99.hi.inet
> 04:44:57,318 ERROR [main] AmbariServer:461 - Failed to run the Ambari Server
> Local Exception Stack:
> Exception [EclipseLink-2004] (Eclipse Persistence Services - 
> 2.4.0.v20120608-r11652): 
> org.eclipse.persistence.exceptions.ConcurrencyException
> Exception Description: A signal was attempted before wait() on 
> ConcurrencyManager. This normally means that an attempt was made to
> commit or rollback a transaction before it was started, or to rollback a 
> transaction twice.
>         at 
> org.eclipse.persistence.exceptions.ConcurrencyException.signalAttemptedBeforeWait(ConcurrencyException.java:84)
>         at 
> org.eclipse.persistence.internal.helper.ConcurrencyManager.releaseReadLock(ConcurrencyManager.java:489)
>         at 
> org.eclipse.persistence.internal.identitymaps.CacheKey.releaseReadLock(CacheKey.java:392)
>         at 
> org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.cloneAndRegisterObject(UnitOfWorkImpl.java:1022)
>         at 
> org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.cloneAndRegisterObject(UnitOfWorkImpl.java:933)
>         at 
> org.eclipse.persistence.internal.sessions.UnitOfWorkIdentityMapAccessor.getAndCloneCacheKeyFromParent(UnitOfWorkIdentityMapAccessor.java:193)
>         at 
> org.eclipse.persistence.internal.sessions.UnitOfWorkIdentityMapAccessor.getFromIdentityMap(UnitOfWorkIdentityMapAccessor.java:121)
>         at 
> org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.registerExistingObject(UnitOfWorkImpl.java:3906)
>         at 
> org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.registerExistingObject(UnitOfWorkImpl.java:3861)
>         at 
> org.eclipse.persistence.mappings.CollectionMapping.buildElementUnitOfWorkClone(CollectionMapping.java:296)
>         at 
> org.eclipse.persistence.mappings.CollectionMapping.buildElementClone(CollectionMapping.java:309)
>         at 
> org.eclipse.persistence.internal.queries.ContainerPolicy.addNextValueFromIteratorInto(ContainerPolicy.java:214)
>         at 
> org.eclipse.persistence.mappings.CollectionMapping.buildCloneForPartObject(CollectionMapping.java:222)
>         at 
> org.eclipse.persistence.internal.indirection.UnitOfWorkQueryValueHolder.buildCloneFor(UnitOfWorkQueryValueHolder.java:56)
>         at 
> org.eclipse.persistence.internal.indirection.UnitOfWorkValueHolder.instantiateImpl(UnitOfWorkValueHolder.java:161)
>         at 
> org.eclipse.persistence.internal.indirection.UnitOfWorkValueHolder.instantiate(UnitOfWorkValueHolder.java:222)
>         at 
> org.eclipse.persistence.internal.indirection.DatabaseValueHolder.getValue(DatabaseValueHolder.java:88)
>         at 
> org.eclipse.persistence.indirection.IndirectList.buildDelegate(IndirectList.java:244)
>         at 
> org.eclipse.persistence.indirection.IndirectList.getDelegate(IndirectList.java:415)
>         at 
> org.eclipse.persistence.indirection.IndirectList.isEmpty(IndirectList.java:490)
>         at 
> org.apache.ambari.server.state.ServiceImpl.<init>(ServiceImpl.java:125)
>         at 
> org.apache.ambari.server.state.ServiceImpl$$EnhancerByGuice$$807a405e.<init>(<generated>)
>         at 
> org.apache.ambari.server.state.ServiceImpl$$EnhancerByGuice$$807a405e$$FastClassByGuice$$1c1221ad.newInstance(<generated>)
>         at 
> com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40)
>         at 
> com.google.inject.internal.ProxyFactory$ProxyConstructor.newInstance(ProxyFactory.java:260)
>         at 
> com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:85)
>         at 
> com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:254)
>         at 
> com.google.inject.internal.InjectorImpl$4$1.call(InjectorImpl.java:978)
>         at 
> com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1024)
>         at 
> com.google.inject.internal.InjectorImpl$4.get(InjectorImpl.java:974)
>         at 
> com.google.inject.assistedinject.FactoryProvider2.invoke(FactoryProvider2.java:632)
>         at $Proxy12.createExisting(Unknown Source)
>         at 
> org.apache.ambari.server.state.cluster.ClusterImpl.loadServices(ClusterImpl.java:218)
>         at 
> org.apache.ambari.server.state.cluster.ClusterImpl.debugDump(ClusterImpl.java:808)
>         at 
> org.apache.ambari.server.state.cluster.ClustersImpl.debugDump(ClustersImpl.java:566)
>         at 
> org.apache.ambari.server.controller.AmbariServer.run(AmbariServer.java:341)
>         at 
> org.apache.ambari.server.controller.AmbariServer.main(AmbariServer.java:458)
> {code}
> The issue seems to be related with the amount of data in the {{ambarirca}}  
> database: it reproduces 80-90% of the time we try to start the ambari-server 
> on an environment in which that DB is 1GB+ and it basically never reproduces 
> on environments with a small DB.
> Running the {{VACUUM FULL}} command does not help minimize the problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to