[ https://issues.apache.org/jira/browse/TRAFODION-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liu ming reassigned TRAFODION-2070: ----------------------------------- Assignee: liu ming > Trafodion cannot adjust your working status in time when network broken. > ------------------------------------------------------------------------ > > Key: TRAFODION-2070 > URL: https://issues.apache.org/jira/browse/TRAFODION-2070 > Project: Apache Trafodion > Issue Type: Bug > Components: dtm > Affects Versions: 2.0-incubating, 2.1-incubating > Reporter: Jarek > Assignee: liu ming > > Issue Title: Trafodion cannot adjust your working status in time when network > broken. > Test Steps (including part 1 and part 2): > Preconditoin: the testing environment is good, including HDFS, HBase and > EsgnDB. > Part 1: Network broken occurred for a long time, here limit it as 15 minutes. > Step 0. We login a traci interface to run a SQL statement ‘STMT_A’, like > insert/delete/update/select, at the same time, the sql statement has running > for several minutes. > Step 1. Use command ‘iptables -I INPUT -s $NODE_HOST -j DROP’ to make nap104 > node’s network unreachable for 15 minutes. > Step 2. Start to do check Step 0, major SQL command and HDFS/HBase running > status. > Here, check Step 0 and major SQL command should be on nap101, nap102 and > nap103 nodes. > T-1 : > Step 0 Comments > Expect 1. When TRAFCI is connected to nap104 node, the SQL > statement ‘STMT_A’ run failed and exit TRAFCI normally. > 2. When TRAFCI is connected to nap101/nap102/nap103 node, the SQL > statement ‘STMT_A’ run success and exit TRAFCI normally. > Actual For expect 1: > ISSUE 1: By checking the QID status of the SQL statement ‘STMT_A’, it > normally displays “SQL>get statistics for qid > MXID11003025894212331445711895145000000000206U3333300_339_SQL_CUR_7; > *** ERROR[2024] Server Process $ZSM003 is not running or could not be > created. Operating System Error 14 was returned. [2016-05-31 09:22:48]”, back > to TRAFCI interface, the SQL statement ‘STMT_A’ cannot be return in time but > hang for a long time. > ISSUE 2: At the same time, open a new TRAFCI that is connected to other node > for example nap102, do SQL query statement ‘STMT_B’, using command > ‘./offender -s active’ to check the QID status of the SQL statement ‘STMT_B’, > but print the following error and say 0 row(s) selected > “*** ERROR[8921] The request to obtain runtime statistics for > ACTIVE_QUERIES=30 timed out. Timeout period specified is 4 seconds. > --- 0 row(s) selected.” > ISSUE 3: back to the TRAFCI session of ‘STMT_A’ and ‘STMT_B’, we can see > these TRAFCI sessions are interrupted because of the below error. > “SQL>select [last 1] * from TRAFODION.JAVABENCH.YCSB_TABLE_20; > *** ERROR[29157] There was a problem reading from the server > *** ERROR[29160] The message header was not long enough > SQL>insert into josh_test_after values (1); > *** ERROR[29443] Database connection does not exist. Please connect to the > database by using the connect or the reconnect command.”. > For expect 2: > ISSUE 1: By checking the QID status of the SQL statement ‘STMT_A’, it > normally displays > “SQL>get statistics for qid > MXID11002036628212331448817878591000000000206U3333300_325_SQL_CUR_5;+> > Qid > MXID11002036628212331448817878591000000000206U3333300_325_SQL_CUR_5 > Compile Start Time 2016/05/31 10:02:56.968052 > Compile End Time 2016/05/31 10:02:59.207231 > Compile Elapsed Time 0:00:02.239179 > Execute Start Time 2016/05/31 10:02:59.207468 > Execute End Time -1 > Execute Elapsed Time 0:05:46.731948 > State CLOSE”, back to TRAFCI interface, the SQL statement > ‘STMT_A’ cannot be return in time but hang for a long time. > ISSUE 2: at the same time, open a new TRAFCI that is connected to other node > for example nap102, do SQL query statement ‘STMT_B’, using command > ‘./offender -s active’ to check the QID status of the SQL statement ‘STMT_B’, > but print the following error and say 0 row(s) selected > “*** ERROR[8921] The request to obtain runtime statistics for > ACTIVE_QUERIES=30 timed out. Timeout period specified is 4 seconds. > --- 0 row(s) selected. > >>” > BTW, All TRAFCI session of ‘STMT_A’ and ‘STMT_B’ are closed normally. > T-2 > Command: sqcheck DTM Down RMS Down DCS Master Down DCS > Server Down MxoSrvr Down Comments > Expect 1 2 0 1 4 Return in 1 minute > Actual 1 2 0 1 4 Return in 1 minute > T-3 > Command: dcscheck DCS Master Down DCS Server Down MxoSrvr Down Comments > Expect 0 1 4 Return in 1 minute > Actual 0 1 4 Return about 5 minutes > T-4 > Command: shell -c node info Comments > Expect Only nap104 node down, return in 30 seconds > Actual Only nap104 node down, return in 10 seconds > T-5 > Command: cstat Comments > Expect Return in 30 seconds. > Actual Return about 3 minutes. > T-6 > Commands: trafci Comments > Expect 1. Login success, Return in 1 minute whatever login > success or failed. > 2. Run new SQL statement ‘STMT_B’ success in TRAFCI and normally > exit TRAFCI. > Actual Case 1: When ‘STMT_A’ is connected to nap104, open a new TRAFCI > session to execute ‘STMT_B’, > 1. Login success ,Return in 1 minute whatever login success or failed. > 2. Run new SQL statement ‘STMT_B’ failed, abnormally exit TRAFCI. > Case 2: When ‘STMT_A’ is connected to other node for example nap102, > open a new TRAFCI session to execute ‘STMT_B’, > 1. Sometime login success, returned time is not fixed, sometime login > failed because of hang for a long time, no message printed for example > timeout tips. > 2. Run new SQL statement ‘STMT_B’ success, normally exit TRAFCI. > T-7 > HDFS Comments > Expect Only nap104 data node down, other 3 data nodes up, 1 name node > up, Data Node Health Summary process reports a minor alert. > Actual Only nap104 data node down, other 3 data nodes up, 1 name node > up, Data Node Health Summary process reports a minor alert. > T-8 > HBase Comments > Expect 1 region server down, other 3 region servers up, 1 HBASE master > up, RegionServer Health Summary process reports a minor alert. > Actual 1 region server down, other 3 region servers up, 1 HBASE master > up, RegionServer Health Summary process reports a minor alert. > Step 3. After 15 minutes, nap104 node’s network is reachable using command > ‘iptables -D INPUT -s $NODE_HOST -j DROP’ > Step 4. Start to do check Step 0, major SQL command and HDFS/HBase running > status. > T-11 > Command: sqcheck DTM Down RMS Down DCS Master Down DCS > Server Down MxoSrvr Down Comments > Expect on any node 0 0 0 0 0 Return in 1 > minute > Actual on nap101 node 1 2 0 1 4 Return in 1 > minute > Actual on nap104 node 3 6 0 1 16 Return in 1 > minute > T-12 > Command: dcscheck DCS Master Down DCS Server Down MxoSrvr Down Cmments > Expect on any node 0 0 0 Return in 1 minute > Actual on nap101 node 0 1 4 Return in 1 minute > Actual on nap104 node 0 1 4 Return in 1 minute > T-13 > Command: shell -c node info Comments > Expect on any node 4 nodes up, return in 30 seconds > Actual on nap101 node Only nap104 down, other nap101, nap102 and nap103 nodes > up, return in 30 seconds. > Actual on nap104 node Only nap101, nap102 and nap103 nodes down, nap104 up, > return in 30 seconds. > T-14 > Command: cstat Comments > Expect on any node return in 30 seconds > Actual on any node return in 30 seconds > T-15 > Command: trafci Comments > Expect on any node 1. Login success, Return in 1 minute whatever > login success or failed. > 2. Run new SQL statement success in TRAFCI and normally exit TRAFCI. > Actual on any node 1. Login success, Return in 1 minute whatever > login success or failed. > 2. Run new SQL statement success in TRAFCI and normally exit TRAFCI. > T-16 > HDFS Comments > Expect 4 data nodes up, 1 name node up, no alerts. > Actual 4 data nodes up, 1 name node up, no alerts. > T-17 > HBase Comments > Expect 4 region servers up, 1 HBASE master up, no alerts. > Actual 1 region server process in nap104 node down (CRITICAL MESSAGE: > Connection failed: [Errno 111] Connection refused to > nap104.esgyn.local:60030), at the same time the nap101 node (HBASE master > node) reports “Dead RegionServer(s): 1 out of 3” critical message by > RegionsServer Health Summary process. 1 HBASE master up. > Part 2: Network unstable. ok for 1 minute and down for another minute, again > and again > Step 0. We login a traci interface to run a SQL statement ‘STMT_A’, like > insert/delete/update/select, at the same time, the sql statement has running > for several minutes. > Step 1. Make nap104 node’s network unstable, check Step 0, major SQL > commands, HDFS and HBase running status. > Here, check Step 0 and major SQL command should be on nap101, > nap102 and nap103 nodes. > T-21 > Step 0 Comments > Expect 1. When TRAFCI is connected to nap104 node, the SQL > statement ‘STMT_A’ run failed and exit TRAFCI normally. > 2. When TRAFCI is connected to nap101/nap102/nap103 node, the SQL > statement ‘STMT_A’ run success and exit TRAFCI normally. > Actual For expect 1: > ISSUE 1: execute ‘STMT_A’ and get the following errors in TRAFCI > interface. > SQL>select [last 1] * from TRAFODION.JAVABENCH.YCSB_TABLE_20; > *** ERROR[29157] There was a problem reading from the server > *** ERROR[29160] The message header was not long enough > SQL>insert into josh_test values (1); > *** ERROR[29443] Database connection does not exist. Please connect to the > database by using the connect or the reconnect command. > ISSUE 2 (Accidental): open a new TRAFCI session to run a new SQL statement > ‘STMT_B’ and get the following errors. > *** ERROR[8448] Unable to access Hbase interface. Call to > ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-706). Cause: > java.util.concurrent.ExecutionException: java.io.IOException: performScan > encountered Exception txID: 8591654594 Exception: > org.apache.hadoop.hbase.UnknownScannerException: TrxRegionEndpoint getScanner > - scanner id 0, already closed? > java.util.concurrent.FutureTask.report(FutureTask.java:122) > java.util.concurrent.FutureTask.get(FutureTask.java:188) > org.trafodion.sql.HTableClient.fetchRows(HTableClient.java:1258) > . [2016-05-31 12:13:05] > For expect 2: > ISSUE 1: execute ‘STMT_A’ and get the following errors in TRAFCI interface. > SQL>select [last 1] * from TRAFODION.JAVABENCH.YCSB_TABLE_20; > *** ERROR[8448] Unable to access Hbase interface. Call to > ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-706). Cause: > java.util.concurrent.ExecutionException: java.io.IOException: performScan > encountered Exception txID: 4296737298 Exception: > org.apache.hadoop.hbase.exceptions.OutOfOrderScannerNextException: > TrxRegionEndpoint coprocessor: getScanner - scanner id 0, Expected > nextCallSeq: 50, But the nextCallSeq received from client: 49 > java.util.concurrent.FutureTask.report(FutureTask.java:122) > java.util.concurrent.FutureTask.get(FutureTask.java:188) > org.trafodion.sql.HTableClient.fetchRows(HTableClient.java:1258) > . [2016-05-31 11:30:36] > T-22 > Command: sqcheck DTM Down RMS Down DCS Master Down > DCS Server Down MxoSrvr Down Comments > Network Broken in a minute Expect 1 0 0 1 4 > Return in 1 minute > Actual 0 0 0 1 4 Return in 1 minute > Network Recover in 1 minute Expect 0 0 0 0 0 > Return in 2 minute > Actual 0 0 0 0 0 Return in 2 minute > T-23 > Command: dcscheck DCS Master Down DCS Server Down MxoSrvr Down > Comments > Network Broken in 1 minute Expect 0 1 4 Return in 2 > minute > Actual 0 1 4 Return in 2 minutes > Network Recover in 1 minute Expect 0 1 4 Return in 2 > minute > Actual 0 1 4 Return in 2 minutes. > T-24 > Command: shell -c node info Comments > Network Broken in 1 minute Expect Only nap104 node down, other nodes up, > return in 30 seconds > Actual 4 nodes up, return in 10 seconds > Network Recover in 1 minute Expect 4 nodes up, return in 10 seconds > Actual 4 nodes up, return in 10 seconds > T-25 > Command: cstat Comments > Network Broken in 1 minute Expect Return in 30 seconds. > Actual Return in 30 seconds. > Network Recover in 1 minute Expect Return in 30 seconds. > Actual Return in 30 seconds. > T-26 > Commands: trafci Comments > Network Broken in 1 minute Expect Login success, Return in 1 minute > whatever login success or failed. > Actual Login success, Return in 1 minute whatever login success or > failed > Network Recover in 1 minute Expect Login success, Return in 1 minute > whatever login success or failed > Actual Login success, Return in 1 minute whatever login success or > failed > T-27 > HDFS Comments > Network Broken in 1 minute Expect Only nap104 data node down, other data > nodes up, 1 name node up. > Actual Only nap104 data node down, other data nodes up, 1 name node up. > Network Recover in 1 minute Expect 4 data nodes up, 1 name node up. > Actual 4 data nodes up, 1 name node up. > T-28 > HBase Comments > Network Broken in 1 minute Expect 1 region server down, other 3 region > servers up, 1 HBASE master up, RegionServer Health Summary process reports a > minor alert. > Actual 1 region server down, other 3 region servers up, 1 HBASE master > up, RegionServer Health Summary process reports a minor alert. > Network Recover in 1 minute Expect 4 region servers up, 1 HBASE master up, > no alerts. > Actual 1 region server process in nap104 node down (CRITICAL MESSAGE: > Connection failed: [Errno 111] Connection refused to > nap104.esgyn.local:60030), at the same time the nap101 node (HBASE master > node) reports “Dead RegionServer(s): 1 out of 3” critical message by > RegionsServer Health Summary process. 1 HBASE master up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)