Jonathan Hurley created AMBARI-12526:
----------------------------------------
Summary: Ambari Cluster Deployment Stuck At 2% With A SQL Deadlock
When Talking to SQL Azure
Key: AMBARI-12526
URL: https://issues.apache.org/jira/browse/AMBARI-12526
Project: Ambari
Issue Type: Bug
Components: ambari-server
Affects Versions: 2.1.0
Reporter: Jonathan Hurley
Assignee: Jonathan Hurley
Priority: Blocker
Fix For: 2.1.1
When deploying a new cluster on SQL Azure, there is a recurring deadlock on the
SQL Server:
{code}
15 Jul 2015 22:13:31,453 ERROR [ambari-action-scheduler]
AmbariJpaLocalTxnInterceptor:114 - [DETAILED ERROR] Rollback reason:
Local Exception Stack:
Exception [EclipseLink-4002] (Eclipse Persistence Services -
2.5.2.v20140319-9ad6abd): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: com.microsoft.sqlserver.jdbc.SQLServerException:
Transaction (Process ID 62) was deadlocked on lock resources with another
process and has been chosen as the deadlock victim. Rerun the transaction.
Error Code: 1205
Call: UPDATE hostcomponentstate SET current_state = ? WHERE ((((component_name
= ?) AND (host_id = ?)) AND (cluster_id = ?)) AND (service_name = ?))
bind => [5 parameters bound]
at
org.eclipse.persistence.exceptions.DatabaseException.sqlException(DatabaseException.java:331)
at
org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.executeDirectNoSelect(DatabaseAccessor.java:900)
at
org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.executeNoSelect(DatabaseAccessor.java:962)
at
org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.basicExecuteCall(DatabaseAccessor.java:631)
at
org.eclipse.persistence.internal.databaseaccess.ParameterizedSQLBatchWritingMechanism.executeBatch(ParameterizedSQLBatchWritingMechanism.java:149)
at
org.eclipse.persistence.internal.databaseaccess.ParameterizedSQLBatchWritingMechanism.executeBatchedStatements(ParameterizedSQLBatchWritingMechanism.java:134)
at
org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.writesCompleted(DatabaseAccessor.java:1836)
at
org.eclipse.persistence.internal.sessions.AbstractSession.writesCompleted(AbstractSession.java:4244)
at
org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.writesCompleted(UnitOfWorkImpl.java:5594)
at
org.eclipse.persistence.internal.sessions.RepeatableWriteUnitOfWork.writeChanges(RepeatableWriteUnitOfWork.java:453)
at
org.eclipse.persistence.internal.jpa.EntityManagerImpl.flush(EntityManagerImpl.java:863)
at
org.eclipse.persistence.internal.jpa.QueryImpl.performPreQueryFlush(QueryImpl.java:963)
at
org.eclipse.persistence.internal.jpa.QueryImpl.executeReadQuery(QueryImpl.java:207)
at
org.eclipse.persistence.internal.jpa.QueryImpl.getSingleResult(QueryImpl.java:517)
at
org.eclipse.persistence.internal.jpa.EJBQueryImpl.getSingleResult(EJBQueryImpl.java:400)
at org.apache.ambari.server.orm.dao.DaoUtils.selectOne(DaoUtils.java:80)
at org.apache.ambari.server.orm.dao.StackDAO.find(StackDAO.java:93)
at
org.apache.ambari.server.orm.AmbariLocalSessionInterceptor.invoke(AmbariLocalSessionInterceptor.java:53)
at
org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl.setStackVersion(ServiceComponentHostImpl.java:1058)
at
org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl$ServiceComponentHostOpStartedTransition.transition(ServiceComponentHostImpl.java:628)
at
org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl$ServiceComponentHostOpStartedTransition.transition(ServiceComponentHostImpl.java:610)
at
org.apache.ambari.server.state.fsm.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:354)
at
org.apache.ambari.server.state.fsm.StateMachineFactory.doTransition(StateMachineFactory.java:294)
at
org.apache.ambari.server.state.fsm.StateMachineFactory.access$300(StateMachineFactory.java:39)
at
org.apache.ambari.server.state.fsm.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:440)
at
org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl.handleEvent(ServiceComponentHostImpl.java:901)
at
org.apache.ambari.server.state.cluster.ClusterImpl.processServiceComponentHostEvents(ClusterImpl.java:2508)
at
org.apache.ambari.server.orm.AmbariJpaLocalTxnInterceptor.invoke(AmbariJpaLocalTxnInterceptor.java:68)
at
org.apache.ambari.server.actionmanager.ActionScheduler.doWork(ActionScheduler.java:343)
at
org.apache.ambari.server.actionmanager.ActionScheduler.run(ActionScheduler.java:195)
at java.lang.Thread.run(Thread.java:745)
{code}
It's just a JPA entity merge that's doing this. The whole transaction is:
{code}
UPDATE hostcomponentstate SET version = ? WHERE ((((component_name = ?) AND
(host_id = ?)) AND (cluster_id = ?)) AND (service_name = ?))
{code}
There is a CLUSTERED index on {{component_name}}, {{host_id}}, {{cluster_id}},
and {{service_name}}, so the predicate of the query caused the Clustered Index
Seek (that's good since that causes a single lock). The {{UPDATE}} however then
causes the Clustered Index Update (which is a {{DELETE}} followed by an
{{INSERT}}, no?)
Essentially, we have concurrent {{UPDATE}} statements in separate transactions
acting on different rows of {{hostcomponentstate}}. This seems to cause a
deadlock because both processes have an X lock and then try to acquire a U
lock. The U lock is what is making me think they are trying to acquire the
table lock in order to update the cluster index.
In any event, it seems like the deadlock is caused by the escalation of a lock
(whether it's a page or table); both processes hold row level key locks which
are exclusive and then they both request a U lock which blocks for both.
Below is the deadlock graph XML:
{code}
<?xml version="1.0" encoding="UTF-8"?>
<deadlock-list>
<deadlock victim="process10faf6ca8">
<process-list>
<process id="process10faf6ca8" taskpriority="0" logused="344"
waitresource="KEY: 5:72057594165723136 (fd5c95c6a91a)" waittime="4391"
ownerId="16290998" transactionname="implicit_transaction"
lasttranstarted="2015-07-22T00:18:01.547" XDES="0x1177363b0" lockMode="U"
schedulerid="2" kpid="3324" status="suspended" spid="54" sbid="0" ecid="0"
priority="0" trancount="2" lastbatchstarted="2015-07-22T00:18:01.547"
lastbatchcompleted="2015-07-22T00:18:01.547"
lastattention="1900-01-01T00:00:00.547" clientapp="Microsoft JDBC Driver for
SQL Server" hostname="headnode0" hostpid="0"
loginname="[email protected]" isolationlevel="read committed
(2)" xactid="16290998" currentdb="5" lockTimeout="4294967295"
clientoption1="671088672" clientoption2="128058">
<executionStack>
<frame procname="adhoc" line="1" stmtstart="160" stmtend="450"
sqlhandle="0x02000000b5fa61048c97ed65c734e84590264b91bbf93aca0000000000000000000000000000000000000000">unknown</frame>
<frame procname="unknown" line="1"
sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">unknown</frame>
</executionStack>
<inputbuf>(@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 bigint,@P3
bigint,@P4 nvarchar(4000))UPDATE hostcomponentstate SET version = @P0 WHERE
((((component_name = @P1) AND (host_id = @P2)) AND (cluster_id = @P3)) AND
(service_name = @P4))</inputbuf>
</process>
<process id="process10dfa1c28" taskpriority="0" logused="520"
waitresource="KEY: 5:72057594165723136 (930b38dc45e3)" waittime="4384"
ownerId="16290959" transactionname="implicit_transaction"
lasttranstarted="2015-07-22T00:18:01.537" XDES="0x1199b43b0" lockMode="U"
schedulerid="1" kpid="3364" status="suspended" spid="61" sbid="0" ecid="0"
priority="0" trancount="2" lastbatchstarted="2015-07-22T00:18:01.553"
lastbatchcompleted="2015-07-22T00:18:01.553"
lastattention="1900-01-01T00:00:00.553" clientapp="Microsoft JDBC Driver for
SQL Server" hostname="headnode0" hostpid="0"
loginname="[email protected]" isolationlevel="read committed
(2)" xactid="16290959" currentdb="5" lockTimeout="4294967295"
clientoption1="671088672" clientoption2="128058">
<executionStack>
<frame procname="adhoc" line="1" stmtstart="160" stmtend="462"
sqlhandle="0x02000000e94be81b7f4521ee46bc4614740d406162aad19d0000000000000000000000000000000000000000">unknown</frame>
<frame procname="unknown" line="1"
sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000">unknown</frame>
</executionStack>
<inputbuf>(@P0 nvarchar(4000),@P1 nvarchar(4000),@P2 bigint,@P3
bigint,@P4 nvarchar(4000))UPDATE hostcomponentstate SET current_state = @P0
WHERE ((((component_name = @P1) AND (host_id = @P2)) AND (cluster_id = @P3))
AND (service_name = @P4))</inputbuf>
</process>
</process-list>
<resource-list>
<keylock hobtid="72057594165723136" dbid="5"
objectname="ambari.dbo.hostcomponentstate"
indexname="PK__hostcomp__C72E1492ED078925" id="lock113b1de80" mode="X"
associatedObjectId="72057594165723136">
<owner-list>
<owner id="process10dfa1c28" mode="X" />
</owner-list>
<waiter-list>
<waiter id="process10faf6ca8" mode="U" requestType="wait" />
</waiter-list>
</keylock>
<keylock hobtid="72057594165723136" dbid="5"
objectname="ambari.dbo.hostcomponentstate"
indexname="PK__hostcomp__C72E1492ED078925" id="lock110cf0b00" mode="X"
associatedObjectId="72057594165723136">
<owner-list>
<owner id="process10faf6ca8" mode="X" />
</owner-list>
<waiter-list>
<waiter id="process10dfa1c28" mode="U" requestType="wait" />
</waiter-list>
</keylock>
</resource-list>
</deadlock>
</deadlock-list>
{code}
Both processes have an X lock on clustered index
{{PK__hostcomp__C72E1492ED078925}} and then both try to obtain a U lock; this
is because they are doing a Clustered Index Update and are requesting an
escalation to a table lock. This is where the deadlock occurs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)