[ 
https://issues.apache.org/jira/browse/AMBARI-24534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Voituret updated AMBARI-24534:
------------------------------------
    Summary: Deadlock issues on query result collection with Ambari Server  
(was: Deadlock)

> Deadlock issues on query result collection with Ambari Server
> -------------------------------------------------------------
>
>                 Key: AMBARI-24534
>                 URL: https://issues.apache.org/jira/browse/AMBARI-24534
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-sever
>    Affects Versions: ambari-server
>         Environment: Cluster with 10 nodes including services from HDP stack 
> (Hortonworks).
> Ambari server run under "Red Hat Enterprise Linux Server release 6.6 
> (Santiago)".
> The database run under postgresql on the same machine.
>            Reporter: Felix Voituret
>            Priority: Blocker
>
> We are currently facing an issue with Ambari Server which causes performance 
> issues, and ends with a JVM crash systematically. Our current production 
> cluster is composed of ten nodes, including most services provided by the 
> Hortonworks Hadoop stack. Performance alert are related to Ambari Server REST 
> API.
> We can easilly reproduce it by just creating activities on the web UI by 
> spamming a little bit the interface (manually, with one or two users). Logs 
> display timeout error which after a certain amount of time ends up with a 
> Java OOM. After investigating here is what we found so far :
> h2. Database
> We use a PostgresSQL database, which in it actual state is still responsible 
> and reactive, we checked some tables such as _alert_history_ (which are 
> approximativly 20k rows) but nothing suspicious. We checked 
> _pg_stat_statements_ table and it appears that there is no slow query at the 
> moment (the higher we could observed only has a 1 seconds average runtime, 
> and even not related to ambari's table).
> h2. JVM
> We have made 6 thread dumps and one heap dumps after generating activity on 
> UI to make it crash. Following details was detected :
>  * 88 threads are present in the JVM
>  * ~= 50 threads are in BLOCKED state (waiting for a lock release)
>  * Over 25 client threads, 22 are also in BLOCKED state (waiting for a lock 
> release)
>  * hprof analysis showed up that 3 client threads own 400Mo of heap memory 
> each
>  * 200Mo from a HashMap which holds ResourceImpl as keys, and Object as value.
>  * 200Mo from 
> org.eclipse.persistence.internal.sessions.RepeatableWriteUnitOfWork instance
> I am currently checking Ambari Server source code through it github 
> repository, matching with thread stack trace using one of the heavy memory 
> consumer thread mentionned earlier as reference :
>  * The deadlock occurs in the 
> org.apache.ambari.server.api.query.QueryImpl#queryForResources method
>  * While collecting result from query, 
> org.apache.ambari.server.controller.internal.ResourceImpl are inserted into a 
> HashSet
>  * Insertion trigger a hashcode computation of ResourceImpl instance, and 
> such hash code is computed with the hash code of a internal synchronized hash 
> map
>  * The hash map is the cause for the deadlock, since it is synchronized, it 
> prevents for total I/O access when used concurrently and the hash code 
> computed from such map use iterator which "fails-fast on concurrent 
> modification".
> This problem is critical as we need to restart ambari server quite often 
> which prevents for efficency during operations. I am still looking for the 
> root cause but i would gladly appreciate some hints about where to look at :)
> I think that a mecanism should be considered to avoid that even is this 
> issues is actually driven by context like refactoring hashcode computation to 
> prevent iterator usage in ResourceImpl and then decreasing deadlock 
> probability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to