[jira] [Updated] (TRAFODION-2664) Instance will be down when the zookeeper on name node has been down

Jarek (JIRA) Wed, 21 Jun 2017 19:28:11 -0700

     [ 
https://issues.apache.org/jira/browse/TRAFODION-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jarek updated TRAFODION-2664:
-----------------------------
    Description: 
Description: Instance will be down when the zookeeper on name node has been down
Test Steps:
Step 1. Start OE and 4 long queries with trafci on the first node esggy-clu-n010
Step 2. Wait several minutes and stop zookeeper on name node of node 
esggy-clu-n010  in Cloudera Manager page.
Step 3. With trafci, run a basic query and 4 long queries again.

In the above Step 3, we will see the whole instance as down after a while. For 
this test scenario, I tried it several times, always found instance as down.

Timestamp:
Test Start Time: 20170616132939
Test End  Time: 20170616134350
Stop zookeeper on name node of node esggy-clu-n010: 20170616133344

Check logs:
1) Each node displays the following error:
2017-06-16 13:33:46,276, ERROR, MON, Node Number: 0,, PIN: 5017 , Process Name: 
$MONITOR,,, TID: 5429, Message ID: 101371801, [CZClient::IsZNodeExpired], 
zoo_exists() for /trafodion/instance/cluster/esggy-clu-n010.esgyn.cn failed 
with error ZCONNECTIONLOSS
2) Zookeeper displays:
ls /trafodion/instance/cluster
[]
So, It seems zclient has been lost on each node.

Location of logs:
esggy-clu-n010: 
/data4/jarek/ha.interactive/trafodion_and_cluster_logs/cluster_logs.20170616134816.tar.gz
 and trafodion_logs.20170616134816.tar.gz
By the way, because the size of the logs is out of the limited size, so i 
cannot upload it as the attachment in this JIRA ID.

How many zookeeper quorum servers in the cluster? total 3.
  <property>
    <name>dcs.zookeeper.quorum</name>
    
<value>esggy-clu-n010.esgyn.cn,esggy-clu-n011.esgyn.cn,esggy-clu-n012.esgyn.cn</value>
  </property>

How to access the cluster?
1.     Login 10.10.10.8 from US machine: trafodion/traf123
2.     Login 10.10.23.19 from 10.10.10.8: trafodion/traf123




  was:
Description: Instance will be down when the zookeeper on name node has been down
Test Steps:
Step 1. Start OE and 4 long queries with trafci on the first node esggy-clu-n010
Step 2. Wait several minutes and stop zookeeper on name node of node 
esggy-clu-n010  in Cloudera Manager page.
Step 3. With trafci, run a basic query and 4 long queries again.

In the above Step 3, we will see the whole instance as down after a while. For 
this test scenario, I tried it several times, always found instance as down.

Timestamp:
Test Start Time: 20170616132939
Test End  Time: 20170616134350
Stop zookeeper on name node of node esggy-clu-n010: 20170616133344

Check logs:
1) Each node displays the following error:
2017-06-16 13:33:46,276, ERROR, MON, Node Number: 0,, PIN: 5017 , Process Name: 
$MONITOR,,, TID: 5429, Message ID: 101371801, [CZClient::IsZNodeExpired], 
zoo_exists() for /trafodion/instance/cluster/esggy-clu-n010.esgyn.cn failed 
with error ZCONNECTIONLOSS
2) Zookeeper displays:
ls /trafodion/instance/cluster
[]
So, It seems zclient has been lost on each node.

Location of logs:
esggy-clu-n010: 
/data4/jarek/ha.interactive/trafodion_and_cluster_logs/cluster_logs.20170616134816.tar.gz
 and trafodion_logs.20170616134816.tar.gz
By the way, because the size of the logs is out of the limited size, so i 
cannot upload it as the attachment in this JIRA ID.

How to access the cluster?
1.     Login 10.10.10.8 from US machine: trafodion/traf123
2.     Login 10.10.23.19 from 10.10.10.8: trafodion/traf123





> Instance will be down when the zookeeper on name node has been down
> -------------------------------------------------------------------
>
>                 Key: TRAFODION-2664
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2664
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: foundation
>    Affects Versions: 2.2-incubating
>         Environment: Test Environment:
> CDH5.4.8: 10.10.23.19:7180, total 6 nodes.
> HDFS-HA and DCS-HA: enabled
> OS: Centos6.8, physic machine.
> SW Build: R2.2.3 (EsgynDB_Enterprise Release 2.2.3 (Build release [sbroeder], 
> branch 1ce8d39-xdc_nari, date 11Jun17)
>            Reporter: Jarek
>            Priority: Critical
>              Labels: build
>             Fix For: 2.2-incubating
>
>
> Description: Instance will be down when the zookeeper on name node has been 
> down
> Test Steps:
> Step 1. Start OE and 4 long queries with trafci on the first node 
> esggy-clu-n010
> Step 2. Wait several minutes and stop zookeeper on name node of node 
> esggy-clu-n010  in Cloudera Manager page.
> Step 3. With trafci, run a basic query and 4 long queries again.
> In the above Step 3, we will see the whole instance as down after a while. 
> For this test scenario, I tried it several times, always found instance as 
> down.
> Timestamp:
> Test Start Time: 20170616132939
> Test End  Time: 20170616134350
> Stop zookeeper on name node of node esggy-clu-n010: 20170616133344
> Check logs:
> 1) Each node displays the following error:
> 2017-06-16 13:33:46,276, ERROR, MON, Node Number: 0,, PIN: 5017 , Process 
> Name: $MONITOR,,, TID: 5429, Message ID: 101371801, 
> [CZClient::IsZNodeExpired], zoo_exists() for 
> /trafodion/instance/cluster/esggy-clu-n010.esgyn.cn failed with error 
> ZCONNECTIONLOSS
> 2) Zookeeper displays:
> ls /trafodion/instance/cluster
> []
> So, It seems zclient has been lost on each node.
> Location of logs:
> esggy-clu-n010: 
> /data4/jarek/ha.interactive/trafodion_and_cluster_logs/cluster_logs.20170616134816.tar.gz
>  and trafodion_logs.20170616134816.tar.gz
> By the way, because the size of the logs is out of the limited size, so i 
> cannot upload it as the attachment in this JIRA ID.
> How many zookeeper quorum servers in the cluster? total 3.
>   <property>
>     <name>dcs.zookeeper.quorum</name>
>     
> <value>esggy-clu-n010.esgyn.cn,esggy-clu-n011.esgyn.cn,esggy-clu-n012.esgyn.cn</value>
>   </property>
> How to access the cluster?
> 1.     Login 10.10.10.8 from US machine: trafodion/traf123
> 2.     Login 10.10.23.19 from 10.10.10.8: trafodion/traf123



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (TRAFODION-2664) Instance will be down when the zookeeper on name node has been down

Reply via email to