Jonathan Hurley created AMBARI-12526:
----------------------------------------

             Summary: Ambari Cluster Deployment Stuck At 2% With A SQL Deadlock 
When Talking to SQL Azure
                 Key: AMBARI-12526
                 URL: https://issues.apache.org/jira/browse/AMBARI-12526
             Project: Ambari
          Issue Type: Bug
          Components: ambari-server
    Affects Versions: 2.1.0
            Reporter: Jonathan Hurley
            Assignee: Jonathan Hurley
            Priority: Blocker
             Fix For: 2.1.1


When deploying a new cluster on SQL Azure, there is a recurring deadlock on the 
SQL Server:

{code}
15 Jul 2015 22:13:31,453 ERROR [ambari-action-scheduler] 
AmbariJpaLocalTxnInterceptor:114 - [DETAILED ERROR] Rollback reason: 
Local Exception Stack: 
Exception [EclipseLink-4002] (Eclipse Persistence Services - 
2.5.2.v20140319-9ad6abd): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: com.microsoft.sqlserver.jdbc.SQLServerException: 
Transaction (Process ID 62) was deadlocked on lock resources with another 
process and has been chosen as the deadlock victim. Rerun the transaction.
Error Code: 1205
Call: UPDATE hostcomponentstate SET current_state = ? WHERE ((((component_name 
= ?) AND (host_id = ?)) AND (cluster_id = ?)) AND (service_name = ?))
        bind => [5 parameters bound]
        at 
org.eclipse.persistence.exceptions.DatabaseException.sqlException(DatabaseException.java:331)
        at 
org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.executeDirectNoSelect(DatabaseAccessor.java:900)
        at 
org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.executeNoSelect(DatabaseAccessor.java:962)
        at 
org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.basicExecuteCall(DatabaseAccessor.java:631)
        at 
org.eclipse.persistence.internal.databaseaccess.ParameterizedSQLBatchWritingMechanism.executeBatch(ParameterizedSQLBatchWritingMechanism.java:149)
        at 
org.eclipse.persistence.internal.databaseaccess.ParameterizedSQLBatchWritingMechanism.executeBatchedStatements(ParameterizedSQLBatchWritingMechanism.java:134)
        at 
org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.writesCompleted(DatabaseAccessor.java:1836)
        at 
org.eclipse.persistence.internal.sessions.AbstractSession.writesCompleted(AbstractSession.java:4244)
        at 
org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.writesCompleted(UnitOfWorkImpl.java:5594)
        at 
org.eclipse.persistence.internal.sessions.RepeatableWriteUnitOfWork.writeChanges(RepeatableWriteUnitOfWork.java:453)
        at 
org.eclipse.persistence.internal.jpa.EntityManagerImpl.flush(EntityManagerImpl.java:863)
        at 
org.eclipse.persistence.internal.jpa.QueryImpl.performPreQueryFlush(QueryImpl.java:963)
        at 
org.eclipse.persistence.internal.jpa.QueryImpl.executeReadQuery(QueryImpl.java:207)
        at 
org.eclipse.persistence.internal.jpa.QueryImpl.getSingleResult(QueryImpl.java:517)
        at 
org.eclipse.persistence.internal.jpa.EJBQueryImpl.getSingleResult(EJBQueryImpl.java:400)
        at org.apache.ambari.server.orm.dao.DaoUtils.selectOne(DaoUtils.java:80)
        at org.apache.ambari.server.orm.dao.StackDAO.find(StackDAO.java:93)
        at 
org.apache.ambari.server.orm.AmbariLocalSessionInterceptor.invoke(AmbariLocalSessionInterceptor.java:53)
        at 
org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl.setStackVersion(ServiceComponentHostImpl.java:1058)
        at 
org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl$ServiceComponentHostOpStartedTransition.transition(ServiceComponentHostImpl.java:628)
        at 
org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl$ServiceComponentHostOpStartedTransition.transition(ServiceComponentHostImpl.java:610)
        at 
org.apache.ambari.server.state.fsm.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:354)
        at 
org.apache.ambari.server.state.fsm.StateMachineFactory.doTransition(StateMachineFactory.java:294)
        at 
org.apache.ambari.server.state.fsm.StateMachineFactory.access$300(StateMachineFactory.java:39)
        at 
org.apache.ambari.server.state.fsm.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:440)
        at 
org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl.handleEvent(ServiceComponentHostImpl.java:901)
        at 
org.apache.ambari.server.state.cluster.ClusterImpl.processServiceComponentHostEvents(ClusterImpl.java:2508)
        at 
org.apache.ambari.server.orm.AmbariJpaLocalTxnInterceptor.invoke(AmbariJpaLocalTxnInterceptor.java:68)
        at 
org.apache.ambari.server.actionmanager.ActionScheduler.doWork(ActionScheduler.java:343)
        at 
org.apache.ambari.server.actionmanager.ActionScheduler.run(ActionScheduler.java:195)
        at java.lang.Thread.run(Thread.java:745)
{code}

It's just a JPA entity merge that's doing this. The whole transaction is:

{code}
UPDATE hostcomponentstate SET version = ? WHERE ((((component_name = ?) AND 
(host_id = ?)) AND (cluster_id = ?)) AND (service_name = ?))
{code}

There is a CLUSTERED index on {{component_name}}, {{host_id}}, {{cluster_id}}, 
and {{service_name}}, so the predicate of the query caused the Clustered Index 
Seek (that's good since that causes a single lock). The {{UPDATE}} however then 
causes the Clustered Index Update (which is a {{DELETE}} followed by an 
{{INSERT}}, no?)

Essentially, we have concurrent {{UPDATE}} statements in separate transactions 
acting on different rows of {{hostcomponentstate}}. This seems to cause a 
deadlock because both processes have an X lock and then try to acquire a U 
lock. The U lock is what is making me think they are trying to acquire the 
table lock in order to update the cluster index.

In any event, it seems like the deadlock is caused by the escalation of a lock 
(whether it's a page or table); both processes hold row level key locks which 
are exclusive and then they both request a U lock which blocks for both.

Below is the deadlock graph XML:
{code}
<?xml version="1.0" encoding="UTF-8"?>
<deadlock-list>
  <deadlock victim="process10faf6ca8">
    <process-list>
      <process id="process10faf6ca8" taskpriority="0" logused="344" 
waitresource="KEY: 5:72057594165723136 (fd5c95c6a91a)" waittime="4391" 
ownerId="16290998" transactionname="implicit_transaction" 
lasttranstarted="2015-07-22T00:18:01.547" XDES="0x1177363b0" lockMode="U" 
schedulerid="2" kpid="3324" status="suspended" spid="54" sbid="0" ecid="0" 
priority="0" trancount="2" lastbatchstarted="2015-07-22T00:18:01.547" 
lastbatchcompleted="2015-07-22T00:18:01.547" 
lastattention="1900-01-01T00:00:00.547" clientapp="Microsoft JDBC Driver for 
SQL Server" hostname="headnode0" hostpid="0" 
loginname="[email protected]" isolationlevel="read committed 
(2)" xactid="16290998" currentdb="5" lockTimeout="4294967295" 
clientoption1="671088672" clientoption2="128058">
        <executionStack>
          <frame procname="adhoc" line="1" stmtstart="160" stmtend="450" 
sqlhandle="0x02000000b5fa61048c97ed65c734e84590264b91bbf93aca0000000000000000000000000000000000000000">unknown</frame>
          <frame procname="unknown" line="1" 
sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">unknown</frame>
        </executionStack>
        <inputbuf>(@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 bigint,@P3 
bigint,@P4 nvarchar(4000))UPDATE hostcomponentstate SET version = @P0 WHERE 
((((component_name = @P1) AND (host_id = @P2)) AND (cluster_id = @P3)) AND 
(service_name = @P4))</inputbuf>
      </process>
      <process id="process10dfa1c28" taskpriority="0" logused="520" 
waitresource="KEY: 5:72057594165723136 (930b38dc45e3)" waittime="4384" 
ownerId="16290959" transactionname="implicit_transaction" 
lasttranstarted="2015-07-22T00:18:01.537" XDES="0x1199b43b0" lockMode="U" 
schedulerid="1" kpid="3364" status="suspended" spid="61" sbid="0" ecid="0" 
priority="0" trancount="2" lastbatchstarted="2015-07-22T00:18:01.553" 
lastbatchcompleted="2015-07-22T00:18:01.553" 
lastattention="1900-01-01T00:00:00.553" clientapp="Microsoft JDBC Driver for 
SQL Server" hostname="headnode0" hostpid="0" 
loginname="[email protected]" isolationlevel="read committed 
(2)" xactid="16290959" currentdb="5" lockTimeout="4294967295" 
clientoption1="671088672" clientoption2="128058">
        <executionStack>
          <frame procname="adhoc" line="1" stmtstart="160" stmtend="462" 
sqlhandle="0x02000000e94be81b7f4521ee46bc4614740d406162aad19d0000000000000000000000000000000000000000">unknown</frame>
          <frame procname="unknown" line="1" 
sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">unknown</frame>
        </executionStack>
        <inputbuf>(@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 bigint,@P3 
bigint,@P4 nvarchar(4000))UPDATE hostcomponentstate SET current_state = @P0 
WHERE ((((component_name = @P1) AND (host_id = @P2)) AND (cluster_id = @P3)) 
AND (service_name = @P4))</inputbuf>
      </process>
    </process-list>
    <resource-list>
      <keylock hobtid="72057594165723136" dbid="5" 
objectname="ambari.dbo.hostcomponentstate" 
indexname="PK__hostcomp__C72E1492ED078925" id="lock113b1de80" mode="X" 
associatedObjectId="72057594165723136">
        <owner-list>
          <owner id="process10dfa1c28" mode="X" />
        </owner-list>
        <waiter-list>
          <waiter id="process10faf6ca8" mode="U" requestType="wait" />
        </waiter-list>
      </keylock>
      <keylock hobtid="72057594165723136" dbid="5" 
objectname="ambari.dbo.hostcomponentstate" 
indexname="PK__hostcomp__C72E1492ED078925" id="lock110cf0b00" mode="X" 
associatedObjectId="72057594165723136">
        <owner-list>
          <owner id="process10faf6ca8" mode="X" />
        </owner-list>
        <waiter-list>
          <waiter id="process10dfa1c28" mode="U" requestType="wait" />
        </waiter-list>
      </keylock>
    </resource-list>
  </deadlock>
</deadlock-list>
{code}

Both processes have an X lock on clustered index 
{{PK__hostcomp__C72E1492ED078925}} and then both try to obtain a U lock; this 
is because they are doing a Clustered Index Update and are requesting an 
escalation to a table lock. This is where the deadlock occurs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to