[
https://issues.apache.org/jira/browse/TRAFODION-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
liu ming closed TRAFODION-2070.
-------------------------------
Resolution: Cannot Reproduce
Fix Version/s: 2.3
> Trafodion cannot adjust your working status in time when network broken.
> ------------------------------------------------------------------------
>
> Key: TRAFODION-2070
> URL: https://issues.apache.org/jira/browse/TRAFODION-2070
> Project: Apache Trafodion
> Issue Type: Bug
> Components: dtm
> Affects Versions: 2.0-incubating, 2.1-incubating
> Reporter: Jarek
> Assignee: liu ming
> Priority: Major
> Fix For: 2.3
>
>
> Issue Title: Trafodion cannot adjust your working status in time when network
> broken.
> Test Steps (including part 1 and part 2):
> Preconditoin: the testing environment is good, including HDFS, HBase and
> EsgnDB.
> Part 1: Network broken occurred for a long time, here limit it as 15 minutes.
> Step 0. We login a traci interface to run a SQL statement ‘STMT_A’, like
> insert/delete/update/select, at the same time, the sql statement has running
> for several minutes.
> Step 1. Use command ‘iptables -I INPUT -s $NODE_HOST -j DROP’ to make nap104
> node’s network unreachable for 15 minutes.
> Step 2. Start to do check Step 0, major SQL command and HDFS/HBase running
> status.
> Here, check Step 0 and major SQL command should be on nap101, nap102 and
> nap103 nodes.
> T-1 :
> Step 0 Comments
> Expect 1. When TRAFCI is connected to nap104 node, the SQL
> statement ‘STMT_A’ run failed and exit TRAFCI normally.
> 2. When TRAFCI is connected to nap101/nap102/nap103 node, the SQL
> statement ‘STMT_A’ run success and exit TRAFCI normally.
> Actual For expect 1:
> ISSUE 1: By checking the QID status of the SQL statement ‘STMT_A’, it
> normally displays “SQL>get statistics for qid
> MXID11003025894212331445711895145000000000206U3333300_339_SQL_CUR_7;
> *** ERROR[2024] Server Process $ZSM003 is not running or could not be
> created. Operating System Error 14 was returned. [2016-05-31 09:22:48]”, back
> to TRAFCI interface, the SQL statement ‘STMT_A’ cannot be return in time but
> hang for a long time.
> ISSUE 2: At the same time, open a new TRAFCI that is connected to other node
> for example nap102, do SQL query statement ‘STMT_B’, using command
> ‘./offender -s active’ to check the QID status of the SQL statement ‘STMT_B’,
> but print the following error and say 0 row(s) selected
> “*** ERROR[8921] The request to obtain runtime statistics for
> ACTIVE_QUERIES=30 timed out. Timeout period specified is 4 seconds.
> --- 0 row(s) selected.”
> ISSUE 3: back to the TRAFCI session of ‘STMT_A’ and ‘STMT_B’, we can see
> these TRAFCI sessions are interrupted because of the below error.
> “SQL>select [last 1] * from TRAFODION.JAVABENCH.YCSB_TABLE_20;
> *** ERROR[29157] There was a problem reading from the server
> *** ERROR[29160] The message header was not long enough
> SQL>insert into josh_test_after values (1);
> *** ERROR[29443] Database connection does not exist. Please connect to the
> database by using the connect or the reconnect command.”.
> For expect 2:
> ISSUE 1: By checking the QID status of the SQL statement ‘STMT_A’, it
> normally displays
> “SQL>get statistics for qid
> MXID11002036628212331448817878591000000000206U3333300_325_SQL_CUR_5;+>
> Qid
> MXID11002036628212331448817878591000000000206U3333300_325_SQL_CUR_5
> Compile Start Time 2016/05/31 10:02:56.968052
> Compile End Time 2016/05/31 10:02:59.207231
> Compile Elapsed Time 0:00:02.239179
> Execute Start Time 2016/05/31 10:02:59.207468
> Execute End Time -1
> Execute Elapsed Time 0:05:46.731948
> State CLOSE”, back to TRAFCI interface, the SQL statement
> ‘STMT_A’ cannot be return in time but hang for a long time.
> ISSUE 2: at the same time, open a new TRAFCI that is connected to other node
> for example nap102, do SQL query statement ‘STMT_B’, using command
> ‘./offender -s active’ to check the QID status of the SQL statement ‘STMT_B’,
> but print the following error and say 0 row(s) selected
> “*** ERROR[8921] The request to obtain runtime statistics for
> ACTIVE_QUERIES=30 timed out. Timeout period specified is 4 seconds.
> --- 0 row(s) selected.
> >>”
> BTW, All TRAFCI session of ‘STMT_A’ and ‘STMT_B’ are closed normally.
> T-2
> Command: sqcheck DTM Down RMS Down DCS Master Down DCS
> Server Down MxoSrvr Down Comments
> Expect 1 2 0 1 4 Return in 1 minute
> Actual 1 2 0 1 4 Return in 1 minute
> T-3
> Command: dcscheck DCS Master Down DCS Server Down MxoSrvr Down Comments
> Expect 0 1 4 Return in 1 minute
> Actual 0 1 4 Return about 5 minutes
> T-4
> Command: shell -c node info Comments
> Expect Only nap104 node down, return in 30 seconds
> Actual Only nap104 node down, return in 10 seconds
> T-5
> Command: cstat Comments
> Expect Return in 30 seconds.
> Actual Return about 3 minutes.
> T-6
> Commands: trafci Comments
> Expect 1. Login success, Return in 1 minute whatever login
> success or failed.
> 2. Run new SQL statement ‘STMT_B’ success in TRAFCI and normally
> exit TRAFCI.
> Actual Case 1: When ‘STMT_A’ is connected to nap104, open a new TRAFCI
> session to execute ‘STMT_B’,
> 1. Login success ,Return in 1 minute whatever login success or failed.
> 2. Run new SQL statement ‘STMT_B’ failed, abnormally exit TRAFCI.
> Case 2: When ‘STMT_A’ is connected to other node for example nap102,
> open a new TRAFCI session to execute ‘STMT_B’,
> 1. Sometime login success, returned time is not fixed, sometime login
> failed because of hang for a long time, no message printed for example
> timeout tips.
> 2. Run new SQL statement ‘STMT_B’ success, normally exit TRAFCI.
> T-7
> HDFS Comments
> Expect Only nap104 data node down, other 3 data nodes up, 1 name node
> up, Data Node Health Summary process reports a minor alert.
> Actual Only nap104 data node down, other 3 data nodes up, 1 name node
> up, Data Node Health Summary process reports a minor alert.
> T-8
> HBase Comments
> Expect 1 region server down, other 3 region servers up, 1 HBASE master
> up, RegionServer Health Summary process reports a minor alert.
> Actual 1 region server down, other 3 region servers up, 1 HBASE master
> up, RegionServer Health Summary process reports a minor alert.
> Step 3. After 15 minutes, nap104 node’s network is reachable using command
> ‘iptables -D INPUT -s $NODE_HOST -j DROP’
> Step 4. Start to do check Step 0, major SQL command and HDFS/HBase running
> status.
> T-11
> Command: sqcheck DTM Down RMS Down DCS Master Down DCS
> Server Down MxoSrvr Down Comments
> Expect on any node 0 0 0 0 0 Return in 1
> minute
> Actual on nap101 node 1 2 0 1 4 Return in 1
> minute
> Actual on nap104 node 3 6 0 1 16 Return in 1
> minute
> T-12
> Command: dcscheck DCS Master Down DCS Server Down MxoSrvr Down Cmments
> Expect on any node 0 0 0 Return in 1 minute
> Actual on nap101 node 0 1 4 Return in 1 minute
> Actual on nap104 node 0 1 4 Return in 1 minute
> T-13
> Command: shell -c node info Comments
> Expect on any node 4 nodes up, return in 30 seconds
> Actual on nap101 node Only nap104 down, other nap101, nap102 and nap103 nodes
> up, return in 30 seconds.
> Actual on nap104 node Only nap101, nap102 and nap103 nodes down, nap104 up,
> return in 30 seconds.
> T-14
> Command: cstat Comments
> Expect on any node return in 30 seconds
> Actual on any node return in 30 seconds
> T-15
> Command: trafci Comments
> Expect on any node 1. Login success, Return in 1 minute whatever
> login success or failed.
> 2. Run new SQL statement success in TRAFCI and normally exit TRAFCI.
> Actual on any node 1. Login success, Return in 1 minute whatever
> login success or failed.
> 2. Run new SQL statement success in TRAFCI and normally exit TRAFCI.
> T-16
> HDFS Comments
> Expect 4 data nodes up, 1 name node up, no alerts.
> Actual 4 data nodes up, 1 name node up, no alerts.
> T-17
> HBase Comments
> Expect 4 region servers up, 1 HBASE master up, no alerts.
> Actual 1 region server process in nap104 node down (CRITICAL MESSAGE:
> Connection failed: [Errno 111] Connection refused to
> nap104.esgyn.local:60030), at the same time the nap101 node (HBASE master
> node) reports “Dead RegionServer(s): 1 out of 3” critical message by
> RegionsServer Health Summary process. 1 HBASE master up.
> Part 2: Network unstable. ok for 1 minute and down for another minute, again
> and again
> Step 0. We login a traci interface to run a SQL statement ‘STMT_A’, like
> insert/delete/update/select, at the same time, the sql statement has running
> for several minutes.
> Step 1. Make nap104 node’s network unstable, check Step 0, major SQL
> commands, HDFS and HBase running status.
> Here, check Step 0 and major SQL command should be on nap101,
> nap102 and nap103 nodes.
> T-21
> Step 0 Comments
> Expect 1. When TRAFCI is connected to nap104 node, the SQL
> statement ‘STMT_A’ run failed and exit TRAFCI normally.
> 2. When TRAFCI is connected to nap101/nap102/nap103 node, the SQL
> statement ‘STMT_A’ run success and exit TRAFCI normally.
> Actual For expect 1:
> ISSUE 1: execute ‘STMT_A’ and get the following errors in TRAFCI
> interface.
> SQL>select [last 1] * from TRAFODION.JAVABENCH.YCSB_TABLE_20;
> *** ERROR[29157] There was a problem reading from the server
> *** ERROR[29160] The message header was not long enough
> SQL>insert into josh_test values (1);
> *** ERROR[29443] Database connection does not exist. Please connect to the
> database by using the connect or the reconnect command.
> ISSUE 2 (Accidental): open a new TRAFCI session to run a new SQL statement
> ‘STMT_B’ and get the following errors.
> *** ERROR[8448] Unable to access Hbase interface. Call to
> ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-706). Cause:
> java.util.concurrent.ExecutionException: java.io.IOException: performScan
> encountered Exception txID: 8591654594 Exception:
> org.apache.hadoop.hbase.UnknownScannerException: TrxRegionEndpoint getScanner
> - scanner id 0, already closed?
> java.util.concurrent.FutureTask.report(FutureTask.java:122)
> java.util.concurrent.FutureTask.get(FutureTask.java:188)
> org.trafodion.sql.HTableClient.fetchRows(HTableClient.java:1258)
> . [2016-05-31 12:13:05]
> For expect 2:
> ISSUE 1: execute ‘STMT_A’ and get the following errors in TRAFCI interface.
> SQL>select [last 1] * from TRAFODION.JAVABENCH.YCSB_TABLE_20;
> *** ERROR[8448] Unable to access Hbase interface. Call to
> ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-706). Cause:
> java.util.concurrent.ExecutionException: java.io.IOException: performScan
> encountered Exception txID: 4296737298 Exception:
> org.apache.hadoop.hbase.exceptions.OutOfOrderScannerNextException:
> TrxRegionEndpoint coprocessor: getScanner - scanner id 0, Expected
> nextCallSeq: 50, But the nextCallSeq received from client: 49
> java.util.concurrent.FutureTask.report(FutureTask.java:122)
> java.util.concurrent.FutureTask.get(FutureTask.java:188)
> org.trafodion.sql.HTableClient.fetchRows(HTableClient.java:1258)
> . [2016-05-31 11:30:36]
> T-22
> Command: sqcheck DTM Down RMS Down DCS Master Down
> DCS Server Down MxoSrvr Down Comments
> Network Broken in a minute Expect 1 0 0 1 4
> Return in 1 minute
> Actual 0 0 0 1 4 Return in 1 minute
> Network Recover in 1 minute Expect 0 0 0 0 0
> Return in 2 minute
> Actual 0 0 0 0 0 Return in 2 minute
> T-23
> Command: dcscheck DCS Master Down DCS Server Down MxoSrvr Down
> Comments
> Network Broken in 1 minute Expect 0 1 4 Return in 2
> minute
> Actual 0 1 4 Return in 2 minutes
> Network Recover in 1 minute Expect 0 1 4 Return in 2
> minute
> Actual 0 1 4 Return in 2 minutes.
> T-24
> Command: shell -c node info Comments
> Network Broken in 1 minute Expect Only nap104 node down, other nodes up,
> return in 30 seconds
> Actual 4 nodes up, return in 10 seconds
> Network Recover in 1 minute Expect 4 nodes up, return in 10 seconds
> Actual 4 nodes up, return in 10 seconds
> T-25
> Command: cstat Comments
> Network Broken in 1 minute Expect Return in 30 seconds.
> Actual Return in 30 seconds.
> Network Recover in 1 minute Expect Return in 30 seconds.
> Actual Return in 30 seconds.
> T-26
> Commands: trafci Comments
> Network Broken in 1 minute Expect Login success, Return in 1 minute
> whatever login success or failed.
> Actual Login success, Return in 1 minute whatever login success or
> failed
> Network Recover in 1 minute Expect Login success, Return in 1 minute
> whatever login success or failed
> Actual Login success, Return in 1 minute whatever login success or
> failed
> T-27
> HDFS Comments
> Network Broken in 1 minute Expect Only nap104 data node down, other data
> nodes up, 1 name node up.
> Actual Only nap104 data node down, other data nodes up, 1 name node up.
> Network Recover in 1 minute Expect 4 data nodes up, 1 name node up.
> Actual 4 data nodes up, 1 name node up.
> T-28
> HBase Comments
> Network Broken in 1 minute Expect 1 region server down, other 3 region
> servers up, 1 HBASE master up, RegionServer Health Summary process reports a
> minor alert.
> Actual 1 region server down, other 3 region servers up, 1 HBASE master
> up, RegionServer Health Summary process reports a minor alert.
> Network Recover in 1 minute Expect 4 region servers up, 1 HBASE master up,
> no alerts.
> Actual 1 region server process in nap104 node down (CRITICAL MESSAGE:
> Connection failed: [Errno 111] Connection refused to
> nap104.esgyn.local:60030), at the same time the nap101 node (HBASE master
> node) reports “Dead RegionServer(s): 1 out of 3” critical message by
> RegionsServer Health Summary process. 1 HBASE master up.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)