-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50079/
-----------------------------------------------------------

(Updated July 15, 2016, 4:06 p.m.)


Review request for Ambari, Alejandro Fernandez, Nate Cole, and Sid Wagle.


Changes
-------

Added test coverage.


Bugs: AMBARI-17738
    https://issues.apache.org/jira/browse/AMBARI-17738


Repository: ambari


Description
-------

Reproduced as part of creating a rolling upgrade on a large cluster.

Initially appearing as a deadlock, it's caused by Postgres is holding the 
socket open indefinitely. We have a write lock being held while the socket is 
open. Jstack dumps taken many minutes apart show the same thread is stuck in a 
socket read. Investigating on Postgres shows that there is a lock blocking the 
thread which is waiting.

The sequence query is currently stuck in the {{idle in transaction}} state 
which is why it's blocking the other query. The transaction isn't being ended 
by EclipseLink.

The cause is that we begin a transaction and then hammer the database for 2-3 
minutes. During which time, Postgres must keep track of all kinds of 
hostcomponentstate updates isolated from our current transaction. When we go to 
commit the upgrade, Postgres eventually ends in a deadlock where it doesn't 
think that the transaction ended.


Diffs (updated)
-----

  
ambari-server/src/main/java/org/apache/ambari/server/controller/internal/UpgradeResourceProvider.java
 2e976ba 
  
ambari-server/src/main/java/org/apache/ambari/server/orm/entities/UpgradeEntity.java
 db27ea5 
  
ambari-server/src/main/java/org/apache/ambari/server/orm/entities/UpgradeGroupEntity.java
 96f96d5 
  
ambari-server/src/main/java/org/apache/ambari/server/orm/entities/UpgradeItemEntity.java
 6e4a889 
  
ambari-server/src/test/java/org/apache/ambari/server/controller/internal/UpgradeResourceProviderTest.java
 a5db0f0 

Diff: https://reviews.apache.org/r/50079/diff/


Testing (updated)
-------

Fixed on a live cluster where it was 100% reproducible.


Thanks,

Jonathan Hurley

Reply via email to