[jira] [Created] (AMBARI-24534) Deadlock

Felix Voituret (JIRA) Fri, 24 Aug 2018 03:19:26 -0700

Felix Voituret created AMBARI-24534:
---------------------------------------


             Summary: Deadlock
                 Key: AMBARI-24534
                 URL: https://issues.apache.org/jira/browse/AMBARI-24534
             Project: Ambari
          Issue Type: Bug
          Components: ambari-sever
    Affects Versions: ambari-server
         Environment: Cluster with 10 nodes including services from HDP stack 
(Hortonworks).

Ambari server run under "Red Hat Enterprise Linux Server release 6.6 
(Santiago)".

The database run under postgresql on the same machine.
            Reporter: Felix Voituret


We are currently facing an issue with Ambari Server which causes performance 
issues, and ends with a JVM crash systematically. Our current production 
cluster is composed of ten nodes, including most services provided by the 
Hortonworks Hadoop stack. Performance alert are related to Ambari Server REST 
API.

We can easilly reproduce it by just creating activities on the web UI by 
spamming a little bit the interface (manually, with one or two users). Logs 
display timeout error which after a certain amount of time ends up with a Java 
OOM. After investigating here is what we found so far :
h2. Database

We use a PostgresSQL database, which in it actual state is still responsible 
and reactive, we checked some tables such as _alert_history_ (which are 
approximativly 20k rows) but nothing suspicious. We checked 
_pg_stat_statements_ table and it appears that there is no slow query at the 
moment (the higher we could observed only has a 1 seconds average runtime, and 
even not related to ambari's table).
h2. JVM

We have made 6 thread dumps and one heap dumps after generating activity on UI 
to make it crash. Following details was detected :
 * 88 threads are present in the JVM
 * ~= 50 threads are in BLOCKED state (waiting for a lock release)
 * Over 25 client threads, 22 are also in BLOCKED state (waiting for a lock 
release)
 * hprof analysis showed up that 3 client threads own 400Mo of heap memory each
 * 200Mo from a HashMap which holds ResourceImpl as keys, and Object as value.
 * 200Mo from 
org.eclipse.persistence.internal.sessions.RepeatableWriteUnitOfWork instance

I am currently checking Ambari Server source code through it github repository, 
matching with thread stack trace using one of the heavy memory consumer thread 
mentionned earlier as reference :
 * The deadlock occurs in the 
org.apache.ambari.server.api.query.QueryImpl#queryForResources method
 * While collecting result from query, 
org.apache.ambari.server.controller.internal.ResourceImpl are inserted into a 
HashSet
 * Insertion trigger a hashcode computation of ResourceImpl instance, and such 
hash code is computed with the hash code of a internal synchronized hash map
 * The hash map is the cause for the deadlock, since it is synchronized, it 
prevents for total I/O access when used concurrently and the hash code computed 
from such map use iterator which "fails-fast on concurrent modification".

This problem is critical as we need to restart ambari server quite often which 
prevents for efficency during operations. I am still looking for the root cause 
but i would gladly appreciate some hints about where to look at :)

I think that a mecanism should be considered to avoid that even is this issues 
is actually driven by context like refactoring hashcode computation to prevent 
iterator usage in ResourceImpl and then decreasing deadlock probability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AMBARI-24534) Deadlock

Reply via email to