Jonathan Hurley created AMBARI-12570:
----------------------------------------

             Summary: Cluster creates stuck at 9x% (deadlock sql exception)
                 Key: AMBARI-12570
                 URL: https://issues.apache.org/jira/browse/AMBARI-12570
             Project: Ambari
          Issue Type: Bug
    Affects Versions: 2.1.0
            Reporter: Jonathan Hurley
            Assignee: Jonathan Hurley
            Priority: Blocker
             Fix For: 2.1.1


Similar to AMBARI-12526, Ambari installation via a blueprint on SQL Azure gets 
stuck somewhere between 90% and 100% because of a SQL Database deadlock. 

This is always between {{hostcomponentstate.current_state}} and 
{{hostcomponentstate.version}}. 

{code}
Rollback reason: 
Local Exception Stack: 
Exception [EclipseLink-4002] (Eclipse Persistence Services - 
2.5.2.v20140319-9ad6abd): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: com.microsoft.sqlserver.jdbc.SQLServerException: 
Transaction (Process ID 62) was deadlocked on lock resources with another 
process and has been chosen as the deadlock victim. Rerun the transaction.
Error Code: 1205
Call: UPDATE hostcomponentstate SET current_state = ? WHERE ((((component_name 
= ?) AND (host_id = ?)) AND (cluster_id = ?)) AND (service_name = ?))
        bind => [5 parameters bound]
        at 
org.eclipse.persistence.exceptions.DatabaseException.sqlException(DatabaseException.java:331)
        at 
org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.executeDirectNoSelect(DatabaseAccessor.java:900)
        at 
org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.executeNoSelect(DatabaseAccessor.java:962)
        at 
org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.basicExecuteCall(DatabaseAccessor.java:631)
        at 
org.eclipse.persistence.internal.databaseaccess.ParameterizedSQLBatchWritingMechanism.executeBatch(ParameterizedSQLBatchWritingMechanism.java:149)
        at 
org.eclipse.persistence.internal.databaseaccess.ParameterizedSQLBatchWritingMechanism.executeBatchedStatements(ParameterizedSQLBatchWritingMechanism.java:134)
        at 
org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.writesCompleted(DatabaseAccessor.java:1836)
        at 
org.eclipse.persistence.internal.sessions.AbstractSession.writesCompleted(AbstractSession.java:4244)
        at 
org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.writesCompleted(UnitOfWorkImpl.java:5594)
        at 
org.eclipse.persistence.internal.sessions.RepeatableWriteUnitOfWork.writeChanges(RepeatableWriteUnitOfWork.java:453)
        at 
org.eclipse.persistence.internal.jpa.EntityManagerImpl.flush(EntityManagerImpl.java:863)
        at 
org.eclipse.persistence.internal.jpa.QueryImpl.performPreQueryFlush(QueryImpl.java:963)
        at 
org.eclipse.persistence.internal.jpa.QueryImpl.executeReadQuery(QueryImpl.java:207)
        at 
org.eclipse.persistence.internal.jpa.QueryImpl.getSingleResult(QueryImpl.java:517)
        at 
org.eclipse.persistence.internal.jpa.EJBQueryImpl.getSingleResult(EJBQueryImpl.java:400)
        at org.apache.ambari.server.orm.dao.DaoUtils.selectOne(DaoUtils.java:80)
        at org.apache.ambari.server.orm.dao.StackDAO.find(StackDAO.java:93)
        at 
org.apache.ambari.server.orm.AmbariLocalSessionInterceptor.invoke(AmbariLocalSessionInterceptor.java:53)
        at 
org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl.setStackVersion(ServiceComponentHostImpl.java:1058)
        at 
org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl$ServiceComponentHostOpStartedTransition.transition(ServiceComponentHostImpl.java:628)
        at 
org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl$ServiceComponentHostOpStartedTransition.transition(ServiceComponentHostImpl.java:610)
        at 
org.apache.ambari.server.state.fsm.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:354)
        at 
org.apache.ambari.server.state.fsm.StateMachineFactory.doTransition(StateMachineFactory.java:294)
        at 
org.apache.ambari.server.state.fsm.StateMachineFactory.access$300(StateMachineFactory.java:39)
        at 
org.apache.ambari.server.state.fsm.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:440)
        at 
org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl.handleEvent(ServiceComponentHostImpl.java:901)
        at 
org.apache.ambari.server.state.cluster.ClusterImpl.processServiceComponentHostEvents(ClusterImpl.java:2508)
        at 
org.apache.ambari.server.orm.AmbariJpaLocalTxnInterceptor.invoke(AmbariJpaLocalTxnInterceptor.java:68)
        at 
org.apache.ambari.server.actionmanager.ActionScheduler.doWork(ActionScheduler.java:343)
        at 
org.apache.ambari.server.actionmanager.ActionScheduler.run(ActionScheduler.java:195)
        at java.lang.Thread.run(Thread.java:745)
{code}

- We have dual X-locks on {{hostcomponentstate}} asking for U-locks when 
updating the CLUSTERED INDEX.
- Both dual X-locks, from different transactions and different processes, are 
on the same row (technically impossible) - based on the XML execution plan, we 
can see that the concurrent UPDATE statements are executing on different rows 
due to their CLUSTERED INDEX predicate.
- In Java, Ambari has locks which prevent concurrent U- or X-locks on the same 
row
- Only happens on SQL Server

My best suspicion right now is that we have a key hash collision happening on 
this table. That's why two processes appear to have the same lock even though 
they are on different rows. 

I was able to use a database dump that I took to compare hash values from 
{{hostcomponentstate}}:

{code:sql}
SELECT  %%lockres%% as lock_hash, cluster_id, host_id, service_name, 
component_name
  FROM hostcomponentstate 
  ORDER BY host_id, service_name, %%lockres%%
{code}

{code}
lock_hash       cluster_id      host_id service_name    component_name
(0d4a8b0869f5)  2       1       HDFS    SECONDARY_NAMENODE
(0d4a8b0869f5)  2       1       HDFS    HDFS_CLIENT
(99fb0081b824)  2       1       MAPREDUCE2      HISTORYSERVER
(7086998db3dc)  2       1       YARN    APP_TIMELINE_SERVER
(7086998db3dc)  2       1       YARN    RESOURCEMANAGER
(3bef52323322)  2       1       ZOOKEEPER       ZOOKEEPER_SERVER
...
{code}

SQL Server is producing lock hashes that collide! It seems like the issue here 
is that we are using a CLUSTERED INDEX on 4 columns, 3 of which are always the 
same (cluster, host, service) in many cases. The only variable is the component 
name. When the hash gets truncated to 6 bytes, we get duplicates. 

So, I think this totally aligns with my suspicions as to why this is only a SQL 
Server problem since other database don't lock like this. It also makes sense 
that this is the only table this happens on since we are not using a surrogate 
PK here. I think we have a few options here:

- Add more columns into the CLUSTERED INDEX in hopes we get a more unique hash. 
The problem is that the other columns are also basically the same.
- Change the CLUSTERED INDEX to an UNCLUSTERED INDEX (since this is our main 
query criteria) and use a single, unique BIGINT PK as we do for many other 
tables. I'm just not sure how SQL Server locks on a row when there is a 
CLUSTERED INDEX which is not part of the predicate.
- Remove the CLUSTERED INDEX entirely (performance would probably tank)
- It's possible we can try to partition this table differently so that lock 
space is more unique.
- Change the existing CLUSTERED INDEX so that it disallows row and page level 
locks, forcing all X-locks to be table-level locks. This would, in theory, 
prevent the deadlock and would not require us to change any data. But it would 
introduce a bottleneck on the table for anything more than a single read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to