[ 
https://issues.apache.org/jira/browse/TRAFODION-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liu ming reassigned TRAFODION-2070:
-----------------------------------

    Assignee: liu ming

> Trafodion cannot adjust your working status in time when network broken.
> ------------------------------------------------------------------------
>
>                 Key: TRAFODION-2070
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2070
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: dtm
>    Affects Versions: 2.0-incubating, 2.1-incubating
>            Reporter: Jarek
>            Assignee: liu ming
>
> Issue Title: Trafodion cannot adjust your working status in time when network 
> broken.
> Test Steps (including part 1 and part 2):
> Preconditoin: the testing environment is good, including HDFS, HBase and 
> EsgnDB.
> Part 1: Network broken occurred for a long time, here limit it as 15 minutes.
> Step 0. We login a traci interface to run a SQL statement ‘STMT_A’, like 
> insert/delete/update/select, at the same time, the sql statement has running 
> for several minutes.
> Step 1. Use command ‘iptables -I INPUT -s $NODE_HOST -j DROP’ to make nap104 
> node’s network unreachable for 15 minutes.
> Step 2. Start to do check Step 0, major SQL command and HDFS/HBase running 
> status.
>      Here, check Step 0 and major SQL command should be on nap101, nap102 and 
> nap103 nodes.
> T-1 : 
> Step 0        Comments
> Expect        1.      When TRAFCI is connected to nap104 node, the SQL 
> statement ‘STMT_A’ run failed and exit TRAFCI normally.
>       2.      When TRAFCI is connected to nap101/nap102/nap103 node, the SQL 
> statement ‘STMT_A’ run success and exit TRAFCI normally.
> Actual        For expect 1: 
> ISSUE 1: By checking the QID status of the SQL statement ‘STMT_A’, it 
> normally displays “SQL>get statistics for qid    
> MXID11003025894212331445711895145000000000206U3333300_339_SQL_CUR_7;
> *** ERROR[2024] Server Process $ZSM003 is not running or could not be 
> created. Operating System Error 14 was returned. [2016-05-31 09:22:48]”, back 
> to TRAFCI interface, the SQL statement ‘STMT_A’ cannot be return in time but 
> hang for a long time.
> ISSUE 2: At the same time, open a new TRAFCI that is connected to other node 
> for example nap102, do SQL query statement ‘STMT_B’, using command 
> ‘./offender -s active’ to check the QID status of the SQL statement ‘STMT_B’, 
>  but print the following error and say 0 row(s) selected 
> “*** ERROR[8921] The request to obtain runtime statistics for 
> ACTIVE_QUERIES=30 timed out. Timeout period specified is 4 seconds.
> --- 0 row(s) selected.” 
> ISSUE 3: back to the TRAFCI session of ‘STMT_A’ and ‘STMT_B’, we can see 
> these TRAFCI sessions are interrupted because of the below error.
> “SQL>select [last 1] * from TRAFODION.JAVABENCH.YCSB_TABLE_20;
> *** ERROR[29157] There was a problem reading from the server
> *** ERROR[29160] The message header was not long enough
> SQL>insert into josh_test_after values (1);
> *** ERROR[29443] Database connection does not exist. Please connect to the 
> database by using the connect or the reconnect command.”.
> For expect 2:
> ISSUE 1:  By checking the QID status of the SQL statement ‘STMT_A’, it 
> normally displays
> “SQL>get statistics for qid
> MXID11002036628212331448817878591000000000206U3333300_325_SQL_CUR_5;+>
> Qid                      
> MXID11002036628212331448817878591000000000206U3333300_325_SQL_CUR_5
> Compile Start Time       2016/05/31 10:02:56.968052
> Compile End Time         2016/05/31 10:02:59.207231
> Compile Elapsed Time                 0:00:02.239179
> Execute Start Time       2016/05/31 10:02:59.207468
> Execute End Time         -1                       
> Execute Elapsed Time                 0:05:46.731948
> State                    CLOSE”, back to TRAFCI interface, the SQL statement 
> ‘STMT_A’ cannot be return in time but hang for a long time.
> ISSUE 2: at the same time, open a new TRAFCI that is connected to other node 
> for example nap102, do SQL query statement ‘STMT_B’, using command 
> ‘./offender -s active’ to check the QID status of the SQL statement ‘STMT_B’, 
>  but print the following error and say 0 row(s) selected
> “*** ERROR[8921] The request to obtain runtime statistics for 
> ACTIVE_QUERIES=30 timed out. Timeout period specified is 4 seconds.
> --- 0 row(s) selected.
> >>”
>       BTW, All TRAFCI session of ‘STMT_A’ and ‘STMT_B’ are closed normally.
> T-2
> Command: sqcheck      DTM Down        RMS Down        DCS Master Down DCS 
> Server Down MxoSrvr Down    Comments
> Expect        1       2       0       1       4       Return in 1 minute
> Actual        1       2       0       1       4       Return in 1 minute
> T-3
> Command: dcscheck     DCS Master Down DCS Server Down MxoSrvr Down    Comments
> Expect        0       1       4       Return in 1 minute
> Actual        0       1       4       Return about 5 minutes
> T-4
> Command: shell -c node info   Comments
> Expect        Only nap104 node down, return in 30 seconds
> Actual        Only nap104 node down, return in 10 seconds
> T-5
> Command: cstat        Comments
> Expect        Return in 30 seconds. 
> Actual        Return about 3 minutes.
> T-6
> Commands: trafci      Comments
> Expect        1.      Login success, Return in 1 minute whatever login 
> success or failed.
>       2.      Run new SQL statement ‘STMT_B’ success in TRAFCI and normally 
> exit TRAFCI.
> Actual        Case 1: When ‘STMT_A’ is connected to nap104, open a new TRAFCI 
> session to execute ‘STMT_B’,
> 1.    Login success ,Return in 1 minute whatever login success or failed.
> 2.    Run new SQL statement ‘STMT_B’ failed, abnormally exit TRAFCI.
>       Case 2: When ‘STMT_A’ is connected to other node for example nap102, 
> open a new TRAFCI session to execute ‘STMT_B’,
> 1.    Sometime login success,  returned time is not fixed, sometime login 
> failed because of hang for a long time, no message printed for example 
> timeout tips.
> 2.    Run new SQL statement ‘STMT_B’ success, normally exit TRAFCI.
> T-7
> HDFS  Comments
> Expect        Only nap104 data node down, other 3 data nodes up, 1 name node 
> up, Data Node Health Summary process reports a minor alert.
> Actual        Only nap104 data node down, other 3 data nodes up, 1 name node 
> up, Data Node Health Summary process reports a minor alert.
> T-8
> HBase Comments
> Expect        1 region server down, other 3 region servers up, 1 HBASE master 
> up, RegionServer Health Summary process reports a minor alert.
> Actual        1 region server down, other 3 region servers up, 1 HBASE master 
> up, RegionServer Health Summary process reports a minor alert.
> Step 3. After 15 minutes, nap104 node’s network is reachable using command 
> ‘iptables -D INPUT -s $NODE_HOST -j DROP’
> Step 4. Start to do check Step 0, major SQL command and HDFS/HBase running 
> status.
> T-11
> Command: sqcheck      DTM Down        RMS Down        DCS Master Down DCS 
> Server Down MxoSrvr Down    Comments
> Expect on any node    0       0       0       0       0       Return in 1 
> minute
> Actual on nap101 node 1       2       0       1       4       Return in 1 
> minute
> Actual on nap104 node 3       6       0       1       16      Return in 1 
> minute
> T-12
> Command: dcscheck     DCS Master Down DCS Server Down MxoSrvr Down    Cmments
> Expect on any node    0       0       0       Return in 1 minute
> Actual on nap101 node 0       1       4       Return in 1 minute
> Actual on nap104 node 0       1       4       Return in 1 minute
> T-13
> Command: shell -c node info   Comments
> Expect on any node    4 nodes up, return in 30 seconds
> Actual on nap101 node Only nap104 down, other nap101, nap102 and nap103 nodes 
> up, return in 30 seconds.
> Actual on nap104 node Only nap101, nap102 and nap103 nodes down, nap104 up, 
> return in 30 seconds.
> T-14
> Command: cstat        Comments
> Expect on any node    return in 30 seconds
> Actual on any node    return in 30 seconds
> T-15
> Command: trafci       Comments
> Expect on any node    1.      Login success, Return in 1 minute whatever 
> login success or failed.
> 2.    Run new SQL statement success in TRAFCI and normally exit TRAFCI.
> Actual on any node    1.      Login success, Return in 1 minute whatever 
> login success or failed.
> 2.    Run new SQL statement success in TRAFCI and normally exit TRAFCI.
> T-16
> HDFS  Comments
> Expect        4 data nodes up, 1 name node up, no alerts.
> Actual        4 data nodes up, 1 name node up, no alerts.
> T-17
> HBase Comments
> Expect        4 region servers up, 1 HBASE master up, no alerts.
> Actual        1 region server process in nap104 node down (CRITICAL MESSAGE: 
> Connection failed: [Errno 111] Connection refused to 
> nap104.esgyn.local:60030), at the same time the nap101 node (HBASE master 
> node) reports “Dead RegionServer(s): 1 out of 3” critical message by 
> RegionsServer Health Summary process. 1 HBASE master up.
> Part 2: Network unstable. ok for 1 minute and down for another minute, again 
> and again
> Step 0. We login a traci interface to run a SQL statement ‘STMT_A’, like 
> insert/delete/update/select, at the same time, the sql statement has running 
> for several minutes.
> Step 1. Make nap104 node’s network unstable, check Step 0, major SQL 
> commands, HDFS and HBase running status.
>            Here, check Step 0 and major SQL command should be on nap101, 
> nap102 and nap103 nodes.
> T-21
> Step 0        Comments
> Expect        1.      When TRAFCI is connected to nap104 node, the SQL 
> statement ‘STMT_A’ run failed and exit TRAFCI normally.
>       2.      When TRAFCI is connected to nap101/nap102/nap103 node, the SQL 
> statement ‘STMT_A’ run success and exit TRAFCI normally.
> Actual        For expect 1:
>       ISSUE 1: execute ‘STMT_A’ and get the following errors in TRAFCI 
> interface.
> SQL>select [last 1] * from TRAFODION.JAVABENCH.YCSB_TABLE_20;
> *** ERROR[29157] There was a problem reading from the server
> *** ERROR[29160] The message header was not long enough
> SQL>insert into josh_test values (1);
> *** ERROR[29443] Database connection does not exist. Please connect to the 
> database by using the connect or the reconnect command.
> ISSUE 2 (Accidental): open a new TRAFCI session to run a new SQL statement 
> ‘STMT_B’ and get the following errors.
> *** ERROR[8448] Unable to access Hbase interface. Call to 
> ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-706). Cause: 
> java.util.concurrent.ExecutionException: java.io.IOException: performScan 
> encountered Exception txID: 8591654594 Exception: 
> org.apache.hadoop.hbase.UnknownScannerException: TrxRegionEndpoint getScanner 
> - scanner id 0, already closed?
> java.util.concurrent.FutureTask.report(FutureTask.java:122)
> java.util.concurrent.FutureTask.get(FutureTask.java:188)
> org.trafodion.sql.HTableClient.fetchRows(HTableClient.java:1258)
> . [2016-05-31 12:13:05]
> For expect 2: 
> ISSUE 1: execute ‘STMT_A’ and get the following errors in TRAFCI interface.
> SQL>select [last 1] * from TRAFODION.JAVABENCH.YCSB_TABLE_20;
> *** ERROR[8448] Unable to access Hbase interface. Call to 
> ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-706). Cause: 
> java.util.concurrent.ExecutionException: java.io.IOException: performScan 
> encountered Exception txID: 4296737298 Exception: 
> org.apache.hadoop.hbase.exceptions.OutOfOrderScannerNextException: 
> TrxRegionEndpoint coprocessor: getScanner - scanner id 0, Expected 
> nextCallSeq: 50, But the nextCallSeq received from client: 49
> java.util.concurrent.FutureTask.report(FutureTask.java:122)
> java.util.concurrent.FutureTask.get(FutureTask.java:188)
> org.trafodion.sql.HTableClient.fetchRows(HTableClient.java:1258)
> . [2016-05-31 11:30:36]
> T-22
>       Command: sqcheck        DTM Down        RMS Down        DCS Master Down 
> DCS Server Down MxoSrvr Down    Comments
> Network Broken in a minute    Expect  1       0       0       1       4       
> Return in 1 minute
>       Actual  0       0       0       1       4       Return in 1 minute
> Network Recover in 1 minute   Expect  0       0       0       0       0       
> Return in 2 minute
>       Actual  0       0       0       0       0       Return in 2 minute
> T-23
>       Command: dcscheck       DCS Master Down DCS Server Down MxoSrvr Down    
> Comments
> Network Broken in 1 minute    Expect  0       1       4       Return in 2 
> minute
>       Actual  0       1       4       Return in 2 minutes
> Network Recover in 1 minute   Expect  0       1       4       Return in 2 
> minute
>       Actual  0       1       4       Return in 2 minutes.
> T-24
>       Command: shell -c node info     Comments
> Network Broken in 1 minute    Expect  Only nap104 node down, other nodes up, 
> return in 30 seconds
>       Actual  4 nodes up, return in 10 seconds
> Network Recover in 1 minute   Expect  4 nodes up, return in 10 seconds
>       Actual  4 nodes up, return in 10 seconds
> T-25
>       Command: cstat  Comments
> Network Broken in 1 minute    Expect  Return in 30 seconds. 
>       Actual  Return in 30 seconds.
> Network Recover in 1 minute   Expect  Return in 30 seconds. 
>       Actual  Return in 30 seconds.
> T-26
>       Commands: trafci        Comments
> Network Broken in 1 minute    Expect  Login success, Return in 1 minute 
> whatever login success or failed.
>       Actual  Login success, Return in 1 minute whatever login success or 
> failed
> Network Recover in 1 minute   Expect  Login success, Return in 1 minute 
> whatever login success or failed
>       Actual  Login success, Return in 1 minute whatever login success or 
> failed
> T-27
>       HDFS    Comments
> Network Broken in 1 minute    Expect  Only nap104 data node down, other data 
> nodes up, 1 name node up.
>       Actual  Only nap104 data node down, other data nodes up, 1 name node up.
> Network Recover in 1 minute   Expect  4 data nodes up, 1 name node up.
>       Actual  4 data nodes up, 1 name node up.
> T-28
>       HBase   Comments
> Network Broken in 1 minute    Expect  1 region server down, other 3 region 
> servers up, 1 HBASE master up, RegionServer Health Summary process reports a 
> minor alert.
>       Actual  1 region server down, other 3 region servers up, 1 HBASE master 
> up, RegionServer Health Summary process reports a minor alert.
> Network Recover in 1 minute   Expect  4 region servers up, 1 HBASE master up, 
> no alerts.
>       Actual  1 region server process in nap104 node down (CRITICAL MESSAGE: 
> Connection failed: [Errno 111] Connection refused to 
> nap104.esgyn.local:60030), at the same time the nap101 node (HBASE master 
> node) reports “Dead RegionServer(s): 1 out of 3” critical message by 
> RegionsServer Health Summary process. 1 HBASE master up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to