[
https://issues.apache.org/jira/browse/AMBARI-24534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Felix Voituret updated AMBARI-24534:
------------------------------------
Summary: Deadlock issues on query result collection with Ambari Server
(was: Deadlock)
> Deadlock issues on query result collection with Ambari Server
> -------------------------------------------------------------
>
> Key: AMBARI-24534
> URL: https://issues.apache.org/jira/browse/AMBARI-24534
> Project: Ambari
> Issue Type: Bug
> Components: ambari-sever
> Affects Versions: ambari-server
> Environment: Cluster with 10 nodes including services from HDP stack
> (Hortonworks).
> Ambari server run under "Red Hat Enterprise Linux Server release 6.6
> (Santiago)".
> The database run under postgresql on the same machine.
> Reporter: Felix Voituret
> Priority: Blocker
>
> We are currently facing an issue with Ambari Server which causes performance
> issues, and ends with a JVM crash systematically. Our current production
> cluster is composed of ten nodes, including most services provided by the
> Hortonworks Hadoop stack. Performance alert are related to Ambari Server REST
> API.
> We can easilly reproduce it by just creating activities on the web UI by
> spamming a little bit the interface (manually, with one or two users). Logs
> display timeout error which after a certain amount of time ends up with a
> Java OOM. After investigating here is what we found so far :
> h2. Database
> We use a PostgresSQL database, which in it actual state is still responsible
> and reactive, we checked some tables such as _alert_history_ (which are
> approximativly 20k rows) but nothing suspicious. We checked
> _pg_stat_statements_ table and it appears that there is no slow query at the
> moment (the higher we could observed only has a 1 seconds average runtime,
> and even not related to ambari's table).
> h2. JVM
> We have made 6 thread dumps and one heap dumps after generating activity on
> UI to make it crash. Following details was detected :
> * 88 threads are present in the JVM
> * ~= 50 threads are in BLOCKED state (waiting for a lock release)
> * Over 25 client threads, 22 are also in BLOCKED state (waiting for a lock
> release)
> * hprof analysis showed up that 3 client threads own 400Mo of heap memory
> each
> * 200Mo from a HashMap which holds ResourceImpl as keys, and Object as value.
> * 200Mo from
> org.eclipse.persistence.internal.sessions.RepeatableWriteUnitOfWork instance
> I am currently checking Ambari Server source code through it github
> repository, matching with thread stack trace using one of the heavy memory
> consumer thread mentionned earlier as reference :
> * The deadlock occurs in the
> org.apache.ambari.server.api.query.QueryImpl#queryForResources method
> * While collecting result from query,
> org.apache.ambari.server.controller.internal.ResourceImpl are inserted into a
> HashSet
> * Insertion trigger a hashcode computation of ResourceImpl instance, and
> such hash code is computed with the hash code of a internal synchronized hash
> map
> * The hash map is the cause for the deadlock, since it is synchronized, it
> prevents for total I/O access when used concurrently and the hash code
> computed from such map use iterator which "fails-fast on concurrent
> modification".
> This problem is critical as we need to restart ambari server quite often
> which prevents for efficency during operations. I am still looking for the
> root cause but i would gladly appreciate some hints about where to look at :)
> I think that a mecanism should be considered to avoid that even is this
> issues is actually driven by context like refactoring hashcode computation to
> prevent iterator usage in ResourceImpl and then decreasing deadlock
> probability.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)