Jarek created TRAFODION-2070:
--------------------------------
Summary: Trafodion cannot adjust your working status in time when
network broken.
Key: TRAFODION-2070
URL: https://issues.apache.org/jira/browse/TRAFODION-2070
Project: Apache Trafodion
Issue Type: Bug
Components: dtm
Affects Versions: 2.0-incubating, 2.1-incubating
Reporter: Jarek
Issue Title: Trafodion cannot adjust your working status in time when network
broken.
Test Steps (including part 1 and part 2):
Preconditoin: the testing environment is good, including HDFS, HBase and EsgnDB.
Part 1: Network broken occurred for a long time, here limit it as 15 minutes.
Step 0. We login a traci interface to run a SQL statement ‘STMT_A’, like
insert/delete/update/select, at the same time, the sql statement has running
for several minutes.
Step 1. Use command ‘iptables -I INPUT -s $NODE_HOST -j DROP’ to make nap104
node’s network unreachable for 15 minutes.
Step 2. Start to do check Step 0, major SQL command and HDFS/HBase running
status.
Here, check Step 0 and major SQL command should be on nap101, nap102 and
nap103 nodes.
T-1 :
Step 0 Comments
Expect 1. When TRAFCI is connected to nap104 node, the SQL statement
‘STMT_A’ run failed and exit TRAFCI normally.
2. When TRAFCI is connected to nap101/nap102/nap103 node, the SQL
statement ‘STMT_A’ run success and exit TRAFCI normally.
Actual For expect 1:
ISSUE 1: By checking the QID status of the SQL statement ‘STMT_A’, it normally
displays “SQL>get statistics for qid
MXID11003025894212331445711895145000000000206U3333300_339_SQL_CUR_7;
*** ERROR[2024] Server Process $ZSM003 is not running or could not be created.
Operating System Error 14 was returned. [2016-05-31 09:22:48]”, back to TRAFCI
interface, the SQL statement ‘STMT_A’ cannot be return in time but hang for a
long time.
ISSUE 2: At the same time, open a new TRAFCI that is connected to other node
for example nap102, do SQL query statement ‘STMT_B’, using command ‘./offender
-s active’ to check the QID status of the SQL statement ‘STMT_B’, but print
the following error and say 0 row(s) selected
“*** ERROR[8921] The request to obtain runtime statistics for ACTIVE_QUERIES=30
timed out. Timeout period specified is 4 seconds.
--- 0 row(s) selected.”
ISSUE 3: back to the TRAFCI session of ‘STMT_A’ and ‘STMT_B’, we can see these
TRAFCI sessions are interrupted because of the below error.
“SQL>select [last 1] * from TRAFODION.JAVABENCH.YCSB_TABLE_20;
*** ERROR[29157] There was a problem reading from the server
*** ERROR[29160] The message header was not long enough
SQL>insert into josh_test_after values (1);
*** ERROR[29443] Database connection does not exist. Please connect to the
database by using the connect or the reconnect command.”.
For expect 2:
ISSUE 1: By checking the QID status of the SQL statement ‘STMT_A’, it normally
displays
“SQL>get statistics for qid
MXID11002036628212331448817878591000000000206U3333300_325_SQL_CUR_5;+>
Qid
MXID11002036628212331448817878591000000000206U3333300_325_SQL_CUR_5
Compile Start Time 2016/05/31 10:02:56.968052
Compile End Time 2016/05/31 10:02:59.207231
Compile Elapsed Time 0:00:02.239179
Execute Start Time 2016/05/31 10:02:59.207468
Execute End Time -1
Execute Elapsed Time 0:05:46.731948
State CLOSE”, back to TRAFCI interface, the SQL statement
‘STMT_A’ cannot be return in time but hang for a long time.
ISSUE 2: at the same time, open a new TRAFCI that is connected to other node
for example nap102, do SQL query statement ‘STMT_B’, using command ‘./offender
-s active’ to check the QID status of the SQL statement ‘STMT_B’, but print
the following error and say 0 row(s) selected
“*** ERROR[8921] The request to obtain runtime statistics for ACTIVE_QUERIES=30
timed out. Timeout period specified is 4 seconds.
--- 0 row(s) selected.
>>”
BTW, All TRAFCI session of ‘STMT_A’ and ‘STMT_B’ are closed normally.
T-2
Command: sqcheck DTM Down RMS Down DCS Master Down DCS
Server Down MxoSrvr Down Comments
Expect 1 2 0 1 4 Return in 1 minute
Actual 1 2 0 1 4 Return in 1 minute
T-3
Command: dcscheck DCS Master Down DCS Server Down MxoSrvr Down Comments
Expect 0 1 4 Return in 1 minute
Actual 0 1 4 Return about 5 minutes
T-4
Command: shell -c node info Comments
Expect Only nap104 node down, return in 30 seconds
Actual Only nap104 node down, return in 10 seconds
T-5
Command: cstat Comments
Expect Return in 30 seconds.
Actual Return about 3 minutes.
T-6
Commands: trafci Comments
Expect 1. Login success, Return in 1 minute whatever login success or
failed.
2. Run new SQL statement ‘STMT_B’ success in TRAFCI and normally
exit TRAFCI.
Actual Case 1: When ‘STMT_A’ is connected to nap104, open a new TRAFCI session
to execute ‘STMT_B’,
1. Login success ,Return in 1 minute whatever login success or failed.
2. Run new SQL statement ‘STMT_B’ failed, abnormally exit TRAFCI.
Case 2: When ‘STMT_A’ is connected to other node for example nap102,
open a new TRAFCI session to execute ‘STMT_B’,
1. Sometime login success, returned time is not fixed, sometime login
failed because of hang for a long time, no message printed for example timeout
tips.
2. Run new SQL statement ‘STMT_B’ success, normally exit TRAFCI.
T-7
HDFS Comments
Expect Only nap104 data node down, other 3 data nodes up, 1 name node up, Data
Node Health Summary process reports a minor alert.
Actual Only nap104 data node down, other 3 data nodes up, 1 name node up, Data
Node Health Summary process reports a minor alert.
T-8
HBase Comments
Expect 1 region server down, other 3 region servers up, 1 HBASE master up,
RegionServer Health Summary process reports a minor alert.
Actual 1 region server down, other 3 region servers up, 1 HBASE master up,
RegionServer Health Summary process reports a minor alert.
Step 3. After 15 minutes, nap104 node’s network is reachable using command
‘iptables -D INPUT -s $NODE_HOST -j DROP’
Step 4. Start to do check Step 0, major SQL command and HDFS/HBase running
status.
T-11
Command: sqcheck DTM Down RMS Down DCS Master Down DCS
Server Down MxoSrvr Down Comments
Expect on any node 0 0 0 0 0 Return in 1
minute
Actual on nap101 node 1 2 0 1 4 Return in 1
minute
Actual on nap104 node 3 6 0 1 16 Return in 1
minute
T-12
Command: dcscheck DCS Master Down DCS Server Down MxoSrvr Down Cmments
Expect on any node 0 0 0 Return in 1 minute
Actual on nap101 node 0 1 4 Return in 1 minute
Actual on nap104 node 0 1 4 Return in 1 minute
T-13
Command: shell -c node info Comments
Expect on any node 4 nodes up, return in 30 seconds
Actual on nap101 node Only nap104 down, other nap101, nap102 and nap103 nodes
up, return in 30 seconds.
Actual on nap104 node Only nap101, nap102 and nap103 nodes down, nap104 up,
return in 30 seconds.
T-14
Command: cstat Comments
Expect on any node return in 30 seconds
Actual on any node return in 30 seconds
T-15
Command: trafci Comments
Expect on any node 1. Login success, Return in 1 minute whatever
login success or failed.
2. Run new SQL statement success in TRAFCI and normally exit TRAFCI.
Actual on any node 1. Login success, Return in 1 minute whatever
login success or failed.
2. Run new SQL statement success in TRAFCI and normally exit TRAFCI.
T-16
HDFS Comments
Expect 4 data nodes up, 1 name node up, no alerts.
Actual 4 data nodes up, 1 name node up, no alerts.
T-17
HBase Comments
Expect 4 region servers up, 1 HBASE master up, no alerts.
Actual 1 region server process in nap104 node down (CRITICAL MESSAGE:
Connection failed: [Errno 111] Connection refused to nap104.esgyn.local:60030),
at the same time the nap101 node (HBASE master node) reports “Dead
RegionServer(s): 1 out of 3” critical message by RegionsServer Health Summary
process. 1 HBASE master up.
Part 2: Network unstable. ok for 1 minute and down for another minute, again
and again
Step 0. We login a traci interface to run a SQL statement ‘STMT_A’, like
insert/delete/update/select, at the same time, the sql statement has running
for several minutes.
Step 1. Make nap104 node’s network unstable, check Step 0, major SQL commands,
HDFS and HBase running status.
Here, check Step 0 and major SQL command should be on nap101, nap102
and nap103 nodes.
T-21
Step 0 Comments
Expect 1. When TRAFCI is connected to nap104 node, the SQL statement
‘STMT_A’ run failed and exit TRAFCI normally.
2. When TRAFCI is connected to nap101/nap102/nap103 node, the SQL
statement ‘STMT_A’ run success and exit TRAFCI normally.
Actual For expect 1:
ISSUE 1: execute ‘STMT_A’ and get the following errors in TRAFCI
interface.
SQL>select [last 1] * from TRAFODION.JAVABENCH.YCSB_TABLE_20;
*** ERROR[29157] There was a problem reading from the server
*** ERROR[29160] The message header was not long enough
SQL>insert into josh_test values (1);
*** ERROR[29443] Database connection does not exist. Please connect to the
database by using the connect or the reconnect command.
ISSUE 2 (Accidental): open a new TRAFCI session to run a new SQL statement
‘STMT_B’ and get the following errors.
*** ERROR[8448] Unable to access Hbase interface. Call to
ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-706). Cause:
java.util.concurrent.ExecutionException: java.io.IOException: performScan
encountered Exception txID: 8591654594 Exception:
org.apache.hadoop.hbase.UnknownScannerException: TrxRegionEndpoint getScanner -
scanner id 0, already closed?
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HTableClient.fetchRows(HTableClient.java:1258)
. [2016-05-31 12:13:05]
For expect 2:
ISSUE 1: execute ‘STMT_A’ and get the following errors in TRAFCI interface.
SQL>select [last 1] * from TRAFODION.JAVABENCH.YCSB_TABLE_20;
*** ERROR[8448] Unable to access Hbase interface. Call to
ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-706). Cause:
java.util.concurrent.ExecutionException: java.io.IOException: performScan
encountered Exception txID: 4296737298 Exception:
org.apache.hadoop.hbase.exceptions.OutOfOrderScannerNextException:
TrxRegionEndpoint coprocessor: getScanner - scanner id 0, Expected nextCallSeq:
50, But the nextCallSeq received from client: 49
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HTableClient.fetchRows(HTableClient.java:1258)
. [2016-05-31 11:30:36]
T-22
Command: sqcheck DTM Down RMS Down DCS Master Down
DCS Server Down MxoSrvr Down Comments
Network Broken in a minute Expect 1 0 0 1 4
Return in 1 minute
Actual 0 0 0 1 4 Return in 1 minute
Network Recover in 1 minute Expect 0 0 0 0 0
Return in 2 minute
Actual 0 0 0 0 0 Return in 2 minute
T-23
Command: dcscheck DCS Master Down DCS Server Down MxoSrvr Down
Comments
Network Broken in 1 minute Expect 0 1 4 Return in 2
minute
Actual 0 1 4 Return in 2 minutes
Network Recover in 1 minute Expect 0 1 4 Return in 2
minute
Actual 0 1 4 Return in 2 minutes.
T-24
Command: shell -c node info Comments
Network Broken in 1 minute Expect Only nap104 node down, other nodes up,
return in 30 seconds
Actual 4 nodes up, return in 10 seconds
Network Recover in 1 minute Expect 4 nodes up, return in 10 seconds
Actual 4 nodes up, return in 10 seconds
T-25
Command: cstat Comments
Network Broken in 1 minute Expect Return in 30 seconds.
Actual Return in 30 seconds.
Network Recover in 1 minute Expect Return in 30 seconds.
Actual Return in 30 seconds.
T-26
Commands: trafci Comments
Network Broken in 1 minute Expect Login success, Return in 1 minute
whatever login success or failed.
Actual Login success, Return in 1 minute whatever login success or
failed
Network Recover in 1 minute Expect Login success, Return in 1 minute
whatever login success or failed
Actual Login success, Return in 1 minute whatever login success or
failed
T-27
HDFS Comments
Network Broken in 1 minute Expect Only nap104 data node down, other data
nodes up, 1 name node up.
Actual Only nap104 data node down, other data nodes up, 1 name node up.
Network Recover in 1 minute Expect 4 data nodes up, 1 name node up.
Actual 4 data nodes up, 1 name node up.
T-28
HBase Comments
Network Broken in 1 minute Expect 1 region server down, other 3 region
servers up, 1 HBASE master up, RegionServer Health Summary process reports a
minor alert.
Actual 1 region server down, other 3 region servers up, 1 HBASE master
up, RegionServer Health Summary process reports a minor alert.
Network Recover in 1 minute Expect 4 region servers up, 1 HBASE master up,
no alerts.
Actual 1 region server process in nap104 node down (CRITICAL MESSAGE:
Connection failed: [Errno 111] Connection refused to nap104.esgyn.local:60030),
at the same time the nap101 node (HBASE master node) reports “Dead
RegionServer(s): 1 out of 3” critical message by RegionsServer Health Summary
process. 1 HBASE master up.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)