[ 
https://issues.apache.org/jira/browse/TRAFODION-924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suresh Subbiah updated TRAFODION-924:
-------------------------------------
    Fix Version/s:     (was: 2.2.0)

> LP Bug: 1413241 - ENDTRANSACTION hang, transaction state FORGETTING
> -------------------------------------------------------------------
>
>                 Key: TRAFODION-924
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-924
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: dtm
>            Reporter: Apache Trafodion
>            Assignee: Atanu Mishra
>            Priority: Critical
>
> A loop to reexecute the seabase developer regression suite hung on the 14th 
> iteration in TEST016. The sqlci console looked like this:
> >>-- char type
> >>create table mcStatPart1
> +>(a int not null not droppable,
> +>b char(10) not null not droppable,
> +>f int, txt char(100),
> +>primary key (a,b))
> +>salt using 8 partitions ;
> --- SQL operation complete.
> >>
> >>insert into mcStatPart1 values 
> >>(1,'123',1,'xyz'),(1,'133',1,'xyz'),(1,'423',1,'xyz'),(2,'111',1,'xyz'),(2,'223',1,'xyz'),(2,'323',1,'xyz'),(2,'423',1,'xyz'),
> +>                           
> (3,'123',1,'xyz'),(3,'133',1,'xyz'),(3,'423',1,'xyz'),(4,'111',1,'xyz'),(4,'223',1,'xyz'),(4,'323',1,'xyz'),(4,'423',1,'xyz');
> A pstack of the sqlci (0,13231) showed it blocking in a call to 
> ENDTRANSACTION.   And dtmci showed this for the transaction:
> DTMCI > list
> Transid         Owner eventQ  pending Joiners TSEs    State
> (0,13742)       0,13231       0       0       0       0       FORGETTING
> Here's a copy of Sean's analysis:
> From: Broeder, Sean 
> Sent: Wednesday, January 21, 2015 8:43 AM
> To: Hanlon, Mike; Cooper, Joanie
> Cc: DeRoo, John
> Subject: RE: ENDTRANSACTION hang, transaction state FORGETTING
> Hi Mike,
> It looks like we have a zookeeper problem right at the time of the commit.  A 
> table is offline:
> 2015-01-21 11:13:45,529 WARN zookeeper.ZKUtil: 
> hconnection-0x1646b7c-0x14aefd0ac4a5e18, quorum=localhost:47570, 
> baseZNode=/hbase Unable to get data of znode 
> /hbase/table/TRAFODION.HBASE.MCSTATPART1
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /hbase/table/TRAFODION.HBASE.MCSTATPART1
> Then we fail after 3 retries of sending the commit request
> 2015-01-21 11:14:04,405 ERROR transactional.TransactionManager: doCommitX, 
> result size: 0
> 2015-01-21 11:14:04,405 ERROR transactional.TransactionManager: doCommitX, 
> result size: 0
> Normally we would create a recovery entry for this transaction to redrive 
> commit, but it appears we are unable to do that due to the zookeeper errors 
> 2015-01-21 11:14:04,408 DEBUG 
> client.HConnectionManager$HConnectionImplementation: Removed all cached 
> region locations that map to g4t3005.houston.hp.com,4       2243,1421362639257
> 471340 2015-01-21 11:14:05,255 WARN zookeeper.RecoverableZooKeeper: Possibly 
> transient ZooKeeper, quorum=localhost:47570, 
> exception=org.apache.zookeeper.KeeperExc       
> eption$ConnectionLossException: KeeperErrorCode = ConnectionLoss for 
> /hbase/table/TRAFODION.HBASE.MCSTATPART1
> 471341 2015-01-21 11:14:05,256 WARN zookeeper.RecoverableZooKeeper: Possibly 
> transient ZooKeeper, quorum=localhost:47570, 
> exception=org.apache.zookeeper.KeeperExc       
> eption$ConnectionLossException: KeeperErrorCode = ConnectionLoss for 
> /hbase/table/TRAFODION.HBASE.MCSTATPART1
> 471342 2015-01-21 11:14:05,256 INFO util.RetryCounter: Sleeping 1000ms before 
> retry #0...
> 471343 2015-01-21 11:14:05,256 INFO util.RetryCounter: Sleeping 1000ms before 
> retry #0...
> Hbase looks like it’s having troubles as I can’t even do a list operation 
> from the hbase shell
> 2015-01-21 14:40:28,816 ERROR [main] 
> client.HConnectionManager$HConnectionImplementation: Can't get connection to 
> ZooKeeper: KeeperErrorCode = ConnectionLoss for /hbase
> We need to think of how better to handle this in the TransactionManager, but 
> in reality I’m not sure what we can do if Zookeeper fails.  You can open an 
> LP bug so we have record of it and can discuss what to do.
> Thanks,
> Sean
> _____________________________________________
> From: Hanlon, Mike 
> Sent: Wednesday, January 21, 2015 6:17 AM
> To: Cooper, Joanie
> Cc: Broeder, Sean; DeRoo, John
> Subject: ENDTRANSACTION hang, transaction state FORGETTING
> Hi Joanie,
> Have we seen this before? A SQL regression test (in this case 
> seabase/TEST016) hangs in a call to ENDTRANSACTION. The transaction state is 
> shown in dtmci to be FORGETTING.  It probably is not easy to reproduce, since 
> the problem occurred on the 14th iteration of a loop to re-execute the 
> seabase suite. 
> There are a lot of messages in 
> /opt/home/mhanlon/trafodion/core/sqf/logs/trafodion.dtm.log on my 
> workstation, sqws112. The transid in question is 13742. Would somebody like 
> to look while things are still hung, before I try to force a cleanup?
> thanks
> Mike



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to