[jira] [Created] (TRAFODION-2070) Trafodion cannot adjust your working status in time when network broken.

Jarek (JIRA) Thu, 16 Jun 2016 23:02:30 -0700

Jarek created TRAFODION-2070:
--------------------------------

             Summary: Trafodion cannot adjust your working status in time when 
network broken.
                 Key: TRAFODION-2070
                 URL: https://issues.apache.org/jira/browse/TRAFODION-2070
             Project: Apache Trafodion
          Issue Type: Bug
          Components: dtm
    Affects Versions: 2.0-incubating, 2.1-incubating
            Reporter: Jarek



Issue Title: Trafodion cannot adjust your working status in time when network 
broken.

Test Steps (including part 1 and part 2):
Preconditoin: the testing environment is good, including HDFS, HBase and EsgnDB.
Part 1: Network broken occurred for a long time, here limit it as 15 minutes.
Step 0. We login a traci interface to run a SQL statement ‘STMT_A’, like 
insert/delete/update/select, at the same time, the sql statement has running 
for several minutes.
Step 1. Use command ‘iptables -I INPUT -s $NODE_HOST -j DROP’ to make nap104 
node’s network unreachable for 15 minutes.
Step 2. Start to do check Step 0, major SQL command and HDFS/HBase running 
status.
     Here, check Step 0 and major SQL command should be on nap101, nap102 and 
nap103 nodes.
T-1 : 
Step 0  Comments
Expect  1.      When TRAFCI is connected to nap104 node, the SQL statement 
‘STMT_A’ run failed and exit TRAFCI normally.
        2.      When TRAFCI is connected to nap101/nap102/nap103 node, the SQL 
statement ‘STMT_A’ run success and exit TRAFCI normally.
Actual  For expect 1: 
ISSUE 1: By checking the QID status of the SQL statement ‘STMT_A’, it normally 
displays “SQL>get statistics for qid    
MXID11003025894212331445711895145000000000206U3333300_339_SQL_CUR_7;

*** ERROR[2024] Server Process $ZSM003 is not running or could not be created. 
Operating System Error 14 was returned. [2016-05-31 09:22:48]”, back to TRAFCI 
interface, the SQL statement ‘STMT_A’ cannot be return in time but hang for a 
long time.

ISSUE 2: At the same time, open a new TRAFCI that is connected to other node 
for example nap102, do SQL query statement ‘STMT_B’, using command ‘./offender 
-s active’ to check the QID status of the SQL statement ‘STMT_B’,  but print 
the following error and say 0 row(s) selected 
“*** ERROR[8921] The request to obtain runtime statistics for ACTIVE_QUERIES=30 
timed out. Timeout period specified is 4 seconds.

--- 0 row(s) selected.” 
ISSUE 3: back to the TRAFCI session of ‘STMT_A’ and ‘STMT_B’, we can see these 
TRAFCI sessions are interrupted because of the below error.
“SQL>select [last 1] * from TRAFODION.JAVABENCH.YCSB_TABLE_20;

*** ERROR[29157] There was a problem reading from the server
*** ERROR[29160] The message header was not long enough

SQL>insert into josh_test_after values (1);

*** ERROR[29443] Database connection does not exist. Please connect to the 
database by using the connect or the reconnect command.”.

For expect 2:
ISSUE 1:  By checking the QID status of the SQL statement ‘STMT_A’, it normally 
displays
“SQL>get statistics for qid
MXID11002036628212331448817878591000000000206U3333300_325_SQL_CUR_5;+>

Qid                      
MXID11002036628212331448817878591000000000206U3333300_325_SQL_CUR_5
Compile Start Time       2016/05/31 10:02:56.968052
Compile End Time         2016/05/31 10:02:59.207231
Compile Elapsed Time                 0:00:02.239179
Execute Start Time       2016/05/31 10:02:59.207468
Execute End Time         -1                       
Execute Elapsed Time                 0:05:46.731948
State                    CLOSE”, back to TRAFCI interface, the SQL statement 
‘STMT_A’ cannot be return in time but hang for a long time.
ISSUE 2: at the same time, open a new TRAFCI that is connected to other node 
for example nap102, do SQL query statement ‘STMT_B’, using command ‘./offender 
-s active’ to check the QID status of the SQL statement ‘STMT_B’,  but print 
the following error and say 0 row(s) selected
“*** ERROR[8921] The request to obtain runtime statistics for ACTIVE_QUERIES=30 
timed out. Timeout period specified is 4 seconds.

--- 0 row(s) selected.
>>”
        BTW, All TRAFCI session of ‘STMT_A’ and ‘STMT_B’ are closed normally.

T-2
Command: sqcheck        DTM Down        RMS Down        DCS Master Down DCS 
Server Down MxoSrvr Down    Comments
Expect  1       2       0       1       4       Return in 1 minute
Actual  1       2       0       1       4       Return in 1 minute

T-3
Command: dcscheck       DCS Master Down DCS Server Down MxoSrvr Down    Comments
Expect  0       1       4       Return in 1 minute
Actual  0       1       4       Return about 5 minutes

T-4
Command: shell -c node info     Comments
Expect  Only nap104 node down, return in 30 seconds
Actual  Only nap104 node down, return in 10 seconds

T-5
Command: cstat  Comments
Expect  Return in 30 seconds. 
Actual  Return about 3 minutes.

T-6
Commands: trafci        Comments
Expect  1.      Login success, Return in 1 minute whatever login success or 
failed.
        2.      Run new SQL statement ‘STMT_B’ success in TRAFCI and normally 
exit TRAFCI.
Actual  Case 1: When ‘STMT_A’ is connected to nap104, open a new TRAFCI session 
to execute ‘STMT_B’,
1.      Login success ,Return in 1 minute whatever login success or failed.
2.      Run new SQL statement ‘STMT_B’ failed, abnormally exit TRAFCI.
        Case 2: When ‘STMT_A’ is connected to other node for example nap102, 
open a new TRAFCI session to execute ‘STMT_B’,
1.      Sometime login success,  returned time is not fixed, sometime login 
failed because of hang for a long time, no message printed for example timeout 
tips.
2.      Run new SQL statement ‘STMT_B’ success, normally exit TRAFCI.

T-7
HDFS    Comments
Expect  Only nap104 data node down, other 3 data nodes up, 1 name node up, Data 
Node Health Summary process reports a minor alert.
Actual  Only nap104 data node down, other 3 data nodes up, 1 name node up, Data 
Node Health Summary process reports a minor alert.

T-8
HBase   Comments
Expect  1 region server down, other 3 region servers up, 1 HBASE master up, 
RegionServer Health Summary process reports a minor alert.
Actual  1 region server down, other 3 region servers up, 1 HBASE master up, 
RegionServer Health Summary process reports a minor alert.

Step 3. After 15 minutes, nap104 node’s network is reachable using command 
‘iptables -D INPUT -s $NODE_HOST -j DROP’
Step 4. Start to do check Step 0, major SQL command and HDFS/HBase running 
status.

T-11
Command: sqcheck        DTM Down        RMS Down        DCS Master Down DCS 
Server Down MxoSrvr Down    Comments
Expect on any node      0       0       0       0       0       Return in 1 
minute
Actual on nap101 node   1       2       0       1       4       Return in 1 
minute
Actual on nap104 node   3       6       0       1       16      Return in 1 
minute

T-12
Command: dcscheck       DCS Master Down DCS Server Down MxoSrvr Down    Cmments
Expect on any node      0       0       0       Return in 1 minute
Actual on nap101 node   0       1       4       Return in 1 minute
Actual on nap104 node   0       1       4       Return in 1 minute

T-13
Command: shell -c node info     Comments
Expect on any node      4 nodes up, return in 30 seconds
Actual on nap101 node   Only nap104 down, other nap101, nap102 and nap103 nodes 
up, return in 30 seconds.
Actual on nap104 node   Only nap101, nap102 and nap103 nodes down, nap104 up, 
return in 30 seconds.

T-14
Command: cstat  Comments
Expect on any node      return in 30 seconds
Actual on any node      return in 30 seconds

T-15
Command: trafci Comments
Expect on any node      1.      Login success, Return in 1 minute whatever 
login success or failed.
2.      Run new SQL statement success in TRAFCI and normally exit TRAFCI.
Actual on any node      1.      Login success, Return in 1 minute whatever 
login success or failed.
2.      Run new SQL statement success in TRAFCI and normally exit TRAFCI.

T-16
HDFS    Comments
Expect  4 data nodes up, 1 name node up, no alerts.
Actual  4 data nodes up, 1 name node up, no alerts.

T-17
HBase   Comments
Expect  4 region servers up, 1 HBASE master up, no alerts.
Actual  1 region server process in nap104 node down (CRITICAL MESSAGE: 
Connection failed: [Errno 111] Connection refused to nap104.esgyn.local:60030), 
at the same time the nap101 node (HBASE master node) reports “Dead 
RegionServer(s): 1 out of 3” critical message by RegionsServer Health Summary 
process. 1 HBASE master up.

Part 2: Network unstable. ok for 1 minute and down for another minute, again 
and again
Step 0. We login a traci interface to run a SQL statement ‘STMT_A’, like 
insert/delete/update/select, at the same time, the sql statement has running 
for several minutes.
Step 1. Make nap104 node’s network unstable, check Step 0, major SQL commands, 
HDFS and HBase running status.
           Here, check Step 0 and major SQL command should be on nap101, nap102 
and nap103 nodes.
T-21
Step 0  Comments
Expect  1.      When TRAFCI is connected to nap104 node, the SQL statement 
‘STMT_A’ run failed and exit TRAFCI normally.
        2.      When TRAFCI is connected to nap101/nap102/nap103 node, the SQL 
statement ‘STMT_A’ run success and exit TRAFCI normally.
Actual  For expect 1:
      ISSUE 1: execute ‘STMT_A’ and get the following errors in TRAFCI 
interface.
SQL>select [last 1] * from TRAFODION.JAVABENCH.YCSB_TABLE_20;

*** ERROR[29157] There was a problem reading from the server
*** ERROR[29160] The message header was not long enough

SQL>insert into josh_test values (1);

*** ERROR[29443] Database connection does not exist. Please connect to the 
database by using the connect or the reconnect command.
ISSUE 2 (Accidental): open a new TRAFCI session to run a new SQL statement 
‘STMT_B’ and get the following errors.
*** ERROR[8448] Unable to access Hbase interface. Call to 
ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-706). Cause: 
java.util.concurrent.ExecutionException: java.io.IOException: performScan 
encountered Exception txID: 8591654594 Exception: 
org.apache.hadoop.hbase.UnknownScannerException: TrxRegionEndpoint getScanner - 
scanner id 0, already closed?
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HTableClient.fetchRows(HTableClient.java:1258)
. [2016-05-31 12:13:05]
For expect 2: 
ISSUE 1: execute ‘STMT_A’ and get the following errors in TRAFCI interface.
SQL>select [last 1] * from TRAFODION.JAVABENCH.YCSB_TABLE_20;

*** ERROR[8448] Unable to access Hbase interface. Call to 
ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-706). Cause: 
java.util.concurrent.ExecutionException: java.io.IOException: performScan 
encountered Exception txID: 4296737298 Exception: 
org.apache.hadoop.hbase.exceptions.OutOfOrderScannerNextException: 
TrxRegionEndpoint coprocessor: getScanner - scanner id 0, Expected nextCallSeq: 
50, But the nextCallSeq received from client: 49
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HTableClient.fetchRows(HTableClient.java:1258)
. [2016-05-31 11:30:36]

T-22
        Command: sqcheck        DTM Down        RMS Down        DCS Master Down 
DCS Server Down MxoSrvr Down    Comments
Network Broken in a minute      Expect  1       0       0       1       4       
Return in 1 minute
        Actual  0       0       0       1       4       Return in 1 minute
Network Recover in 1 minute     Expect  0       0       0       0       0       
Return in 2 minute
        Actual  0       0       0       0       0       Return in 2 minute

T-23
        Command: dcscheck       DCS Master Down DCS Server Down MxoSrvr Down    
Comments
Network Broken in 1 minute      Expect  0       1       4       Return in 2 
minute
        Actual  0       1       4       Return in 2 minutes
Network Recover in 1 minute     Expect  0       1       4       Return in 2 
minute
        Actual  0       1       4       Return in 2 minutes.

T-24
        Command: shell -c node info     Comments
Network Broken in 1 minute      Expect  Only nap104 node down, other nodes up, 
return in 30 seconds
        Actual  4 nodes up, return in 10 seconds
Network Recover in 1 minute     Expect  4 nodes up, return in 10 seconds
        Actual  4 nodes up, return in 10 seconds

T-25
        Command: cstat  Comments
Network Broken in 1 minute      Expect  Return in 30 seconds. 
        Actual  Return in 30 seconds.
Network Recover in 1 minute     Expect  Return in 30 seconds. 
        Actual  Return in 30 seconds.

T-26
        Commands: trafci        Comments
Network Broken in 1 minute      Expect  Login success, Return in 1 minute 
whatever login success or failed.
        Actual  Login success, Return in 1 minute whatever login success or 
failed
Network Recover in 1 minute     Expect  Login success, Return in 1 minute 
whatever login success or failed
        Actual  Login success, Return in 1 minute whatever login success or 
failed

T-27
        HDFS    Comments
Network Broken in 1 minute      Expect  Only nap104 data node down, other data 
nodes up, 1 name node up.
        Actual  Only nap104 data node down, other data nodes up, 1 name node up.
Network Recover in 1 minute     Expect  4 data nodes up, 1 name node up.
        Actual  4 data nodes up, 1 name node up.

T-28
        HBase   Comments
Network Broken in 1 minute      Expect  1 region server down, other 3 region 
servers up, 1 HBASE master up, RegionServer Health Summary process reports a 
minor alert.
        Actual  1 region server down, other 3 region servers up, 1 HBASE master 
up, RegionServer Health Summary process reports a minor alert.
Network Recover in 1 minute     Expect  4 region servers up, 1 HBASE master up, 
no alerts.
        Actual  1 region server process in nap104 node down (CRITICAL MESSAGE: 
Connection failed: [Errno 111] Connection refused to nap104.esgyn.local:60030), 
at the same time the nap101 node (HBASE master node) reports “Dead 
RegionServer(s): 1 out of 3” critical message by RegionsServer Health Summary 
process. 1 HBASE master up.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TRAFODION-2070) Trafodion cannot adjust your working status in time when network broken.

Reply via email to