RE: Trafodion master Daily Test Result - 332 - Still Failing This test Executor/Test106 has been passing since several builds except in this build instance. Builds subsequent to this build also have this test passing consistently.
In the current failed test, something has triggered failure of closing an active scanner. The behavior we see (the hang situation) is the expected behavior when an active scanner is not closed or cleaned up. There are two Jiras that is being worked on that could potentially cause a situation like this to occur. TRAFODION-2149 and TRAFODION-2151. If this problem happens again even after fixes to the two Jiras, we may have to look into this issue much closely. Thanks, Prashanth *From:* Sean Broeder [mailto:sean.broe...@esgyn.com] *Sent:* Thursday, September 15, 2016 10:14 AM *To:* Narain Arvind <narain.arv...@gmail.com>; dev@trafodion.incubator.apache.org *Cc:* Prashanth Vasudev <prashanth.vasu...@esgyn.com> *Subject:* RE: Trafodion master Daily Test Result - 332 - Still Failing In this case we are trying to drop a table and we want to disable all the regions. Before doing so, each region makes sure it allows any pending transactions to complete before closing. One region finds an active transactional scanner and continues to wait for the scanner to clear. 2016-09-15 10:26:55,574 INFO [PriorityRpcServer.handler=11,queue=1,port=16020] transactional.SplitBalanceHelper: scannersListClear Active Scanner found, ScannerId: 0 Txid: 1054 Region: TRAFODION.SCH.T106A,\x00\x00\x00\x02\x00\x00\x00\x00,1473934644524.39d9ea92d6f2109a682f66f38a795b79. 2016-09-15 10:26:55,726 INFO [PriorityRpcServer.handler=8,queue=0,port=16020] transactional.SplitBalanceHelper: scannersListClear Active Scanner found, ScannerId: 0 Txid: 1054 Region: TRAFODION.SCH.T106A,\x00\x00\x00\x02\x00\x00\x00\x00,1473934644524.39d9ea92d6f2109a682f66f38a795b79. 2016-09-15 10:26:56,075 INFO [PriorityRpcServer.handler=11,queue=1,port=16020] transactional.SplitBalanceHelper: scannersListClear Active Scanner found, ScannerId: 0 Txid: 1054 Region: TRAFODION.SCH.T106A,\x00\x00\x00\x02\x00\x00\x00\x00,1473934644524.39d9ea92d6f2109a682f66f38a795b79. 2016-09-15 10:26:56,226 INFO [PriorityRpcServer.handler=8,queue=0,port=16020] transactional.SplitBalanceHelper: scannersListClear Active Scanner found, ScannerId: 0 Txid: 1054 Region: TRAFODION.SCH.T106A,\x00\x00\x00\x02\x00\x00\x00\x00,1473934644524.39d9ea92d6f2109a682f66f38a795b79. Since it doesn’t clear it essentially locks up the drop forever. I know there was a bug for a while where a transaction’s own scanner would block a drop, but a change was made to allow the close to go through if the scanner was for our own transaction. The logging doesn’t show the transaction of the drop (that I see). I think Prashanth should have a look at this. Regards, Sean *From:* Sean Broeder [mailto:sean.broe...@esgyn.com] *Sent:* Thursday, September 15, 2016 9:47 AM *To:* 'Narain Arvind' <narain.arv...@gmail.com>; ' dev@trafodion.incubator.apache.org' <dev@trafodion.incubator.apache.org> *Subject:* RE: Trafodion master Daily Test Result - 332 - Still Failing Sure. I’ll have a look *From:* Narain Arvind [mailto:narain.arv...@gmail.com <narain.arv...@gmail.com>] *Sent:* Thursday, September 15, 2016 9:22 AM *To:* dev@trafodion.incubator.apache.org *Cc:* Sean Broeder <sean.broe...@esgyn.com> *Subject:* RE: Trafodion master Daily Test Result - 332 - Still Failing Timeout in core-regress-executor-hdp is during drop of table T106A. https://jenkins.esgyn.com/job/core-regress-executor-hdp/369/console Messages related to transactional.SplitBalancerHelper repeated in Region Server logs. Sean could you please take a look ? Thanks. http://traf-testlogs.esgyn.com/Daily-master/332/regress-executor-hdp/hbase-logs/hbase-hbase-regionserver-slave-ahw23.log 2016-09-15 10:25:01,806 INFO [RS_CLOSE_REGION-slave-ahw23:16020-2] regionserver.HRegion: Closed TRAFODION.SCH.T106A,\x00\x00\x00\x01\x00\x00\x00\x00,1473934644524.f32b6c8d8f3932729b9824f51a95a63e. 2016-09-15 10:25:02,059 INFO [PriorityRpcServer.handler=8,queue=0,port=16020] transactional.SplitBalanceHelper: scannersListClear Active Scanner found, ScannerId: 0 Txid: 1054 Region: TRAFODION.SCH.T106A,\x00\x00\x00\x02\x00\x00\x00\x00,1473934644524.39d9ea92d6f2109a682f66f38a795b79. … 2016-09-15 10:31:37,204 INFO [regionserver/ slave-ahw23.trafodion.org/172.31.3.234:16020.leaseChecker] regionserver.RSRpcServices: Scanner 12971 lease expired on region TRAFODION.SCH.T106A,\x00\x00\x00\x01\x00\x00\x00\x00,1473934644524.f32b6c8d8f3932729b9824f51a95a63e. 2016-09-15 10:31:37,205 ERROR [regionserver/ slave-ahw23.trafodion.org/172.31.3.234:16020.leaseChecker] regionserver.RSRpcServices: Closing scanner for TRAFODION.SCH.T106A,\x00\x00\x00\x01\x00\x00\x00\x00,1473934644524.f32b6c8d8f3932729b9824f51a95a63e. org.apache.hadoop.hbase.NotServingRegionException: Region TRAFODION.SCH.T106A,\x00\x00\x00\x01\x00\x00\x00\x00,1473934644524.f32b6c8d8f3932729b9824f51a95a63e. is not online on slave-ahw23.trafodion.org,16020,1473931541282 at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2898) at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2875) at org.apache.hadoop.hbase.regionserver.RSRpcServices$ScannerListener.leaseExpired(RSRpcServices.java:285) at org.apache.hadoop.hbase.regionserver.Leases.run(Leases.java:121) at java.lang.Thread.run(Thread.java:745) … 2016-09-15 11:29:49,857 INFO [PriorityRpcServer.handler=1,queue=1,port=16020] transactional.SplitBalanceHelper: scannersListClear Active Scanner found, ScannerId: 0 Txid: 1054 Region: TRAFODION.SCH.T106A,\x00\x00\x00\x02\x00\x00\x00\x00,1473934644524.39d9ea92d6f2109a682f66f38a795b79. 2016-09-15 11:29:49,864 INFO [PriorityRpcServer.handler=19,queue=1,port=16020] transactional.SplitBalanceHelper: scannersListClear Active Scanner found, ScannerId: 0 Txid: 1054 Region: TRAFODION.SCH.T106A,\x00\x00\x00\x02\x00\x00\x00\x00,1473934644524.39d9ea92d6f2109a682f66f38a795b79. >From the master logs: http://traf-testlogs.esgyn.com/Daily-master/332/regress-executor-hdp/hbase-logs/hbase-hbase-master-slave-ahw23.log 2016-09-15 10:25:01,810 INFO [AM.ZK.Worker-pool2-t265] master.RegionStates: Transition {f32b6c8d8f3932729b9824f51a95a63e state=PENDING_CLOSE, ts=1473935101056, server=slave-ahw23.trafodion.org,16020,1473931541282} to {f32b6c8d8f3932729b9824f51a95a63e state=OFFLINE, ts=1473935101810, server=slave-ahw23.trafodion.org,16020,1473931541282} 2016-09-15 10:25:01,810 INFO [AM.ZK.Worker-pool2-t265] master.RegionStates: Offlined f32b6c8d8f3932729b9824f51a95a63e from slave-ahw23.trafodion.org,16020,1473931541282 2016-09-15 10:26:31,059 INFO [slave-ahw23.trafodion.org,16000,1473931529194-org.apache.hadoop.hbase.master.procedure.DisableTableProcedure$BulkDisabler-2] master.AssignmentManager: Server slave-ahw23.trafodion.org,16020,1473931541282 returned java.io.IOException: Call to slave-ahw23.trafodion.org/172.31.3.234:16020 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=496, waitTime=90001, operationTimeout=90000 expired. for TRAFODION.SCH.T106A,\x00\x00\x00\x02\x00\x00\x00\x00,1473934644524.39d9ea92d6f2109a682f66f38a795b79., try=1 of 10 java.io.IOException: Call to slave-ahw23.trafodion.org/172.31.3.234:16020 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=496, waitTime=90001, operationTimeout=90000 expired. at org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1262) at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1230) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287) at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.closeRegion(AdminProtos.java:23149) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.closeRegion(ProtobufUtil.java:1737) at org.apache.hadoop.hbase.master.ServerManager.sendRegionClose(ServerManager.java:809) at org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1853) at org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:2571) at org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:2583) at org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:2458) at org.apache.hadoop.hbase.master.procedure.DisableTableProcedure$BulkDisabler$1.run(DisableTableProcedure.java:534) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=496, waitTime=90001, operationTimeout=90000 expired. at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:70) at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1204) ... 13 more >From the dtm logs: http://traf-testlogs.esgyn.com/Daily-master/332/regress-executor-hdp/traf_run/logs/trafodion.dtm.log 2016-09-15 10:35:08,544 ERROR dtm.HBaseTxClient: Returning from HBaseTxClient:prepareCommit, txid: 1064 retval: RET_IOEXCEPTION IOException org.apache.hadoop.hbase.exceptions.TimeoutIOException: java.util.concurrent.TimeoutException: The procedure 672 is still running at org.apache.hadoop.hbase.client.HBaseAdmin.disableTable(HBaseAdmin.java:1205) at org.apache.hadoop.hbase.client.transactional.TransactionManager.disableTable(TransactionManager.java:2959) at org.apache.hadoop.hbase.client.transactional.TransactionManager.prepareCommit(TransactionManager.java:1893) at org.trafodion.dtm.HBaseTxClient.prepareCommit(HBaseTxClient.java:489) Caused by: java.util.concurrent.TimeoutException: The procedure 672 is still running at org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:4177) at org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:4098) at org.apache.hadoop.hbase.client.HBaseAdmin.disableTable(HBaseAdmin.java:1201) ... 3 more Regards Arvind -----Original Message----- From: steve.var...@esgyn.com [mailto:steve.var...@esgyn.com <steve.var...@esgyn.com>] Sent: Thursday, September 15, 2016 4:54 AM To: dev@trafodion.incubator.apache.org Subject: Trafodion master Daily Test Result - 332 - Still Failing Daily Automated Testing master Jenkins Job: https://jenkins.esgyn.com/job/Check-Daily-master/332/ Archived Logs: http://traf-testlogs.esgyn.com/Daily-master/332 Bld Downloads: http://traf-builds.esgyn.com Changes since previous daily build: [dbirdsall] [TRAFODION-2187] Fix DROP SCHEMA CASCADE when sample tables are present [anoop.sharma] jira TRAFODION-2184 groupby/orderby construct extensions and enablement [hzeller] [TRAFODION-2222] Trafodion on HDP 2.3 needs yarn client in class path [anoop.sharma] jira TRAFODION-2184 additional changes Test Job Results: FAILURE core-regress-executor-hdp (2 hr 20 min) SUCCESS build-rh6-master-debug (30 min) SUCCESS build-rh6-master-release (35 min) SUCCESS core-regress-charsets-cdh (44 min) SUCCESS core-regress-charsets-hdp (43 min) SUCCESS core-regress-compGeneral-cdh (41 min) SUCCESS core-regress-compGeneral-hdp (1 hr 1 min) SUCCESS core-regress-core-cdh (1 hr 1 min) SUCCESS core-regress-core-hdp (1 hr 25 min) SUCCESS core-regress-executor-cdh (1 hr 14 min) SUCCESS core-regress-fullstack2-cdh (10 min) SUCCESS core-regress-fullstack2-hdp (21 min) SUCCESS core-regress-hive-cdh (49 min) SUCCESS core-regress-hive-hdp (48 min) SUCCESS core-regress-privs1-cdh (50 min) SUCCESS core-regress-privs1-hdp (53 min) SUCCESS core-regress-privs2-cdh (51 min) SUCCESS core-regress-privs2-hdp (1 hr 20 min) SUCCESS core-regress-qat-cdh (29 min) SUCCESS core-regress-qat-hdp (38 min) SUCCESS core-regress-seabase-cdh (1 hr 27 min) SUCCESS core-regress-seabase-hdp (1 hr 54 min) SUCCESS core-regress-udr-cdh (25 min) SUCCESS core-regress-udr-hdp (40 min) SUCCESS jdbc_test-cdh (30 min) SUCCESS jdbc_test-hdp (48 min) SUCCESS phoenix_part1_T2-cdh (1 hr 7 min) SUCCESS phoenix_part1_T2-hdp (1 hr 19 min) SUCCESS phoenix_part1_T4-cdh (1 hr 2 min) SUCCESS phoenix_part1_T4-hdp (1 hr 23 min) SUCCESS phoenix_part2_T2-cdh (1 hr 2 min) SUCCESS phoenix_part2_T2-hdp (1 hr 19 min) SUCCESS phoenix_part2_T4-cdh (1 hr 0 min) SUCCESS phoenix_part2_T4-hdp (1 hr 19 min) SUCCESS pyodbc_test-cdh (11 min) SUCCESS pyodbc_test-hdp (27 min)