the problem is figured out by one of my co-worker. Someone put a zoo.cfg under ./hbase/conf, which messed up the quorum look up.
many thanks for your help On Fri, Nov 1, 2013 at 12:48 PM, Demai Ni <[email protected]> wrote: > I injected more debug code into ReplicationPeer. > > public ReplicationPeer(Configuration conf, String key, > String id) throws IOException { > this.conf = conf; > this.clusterKey = key; > this.id = id; > this.reloadZkWatcher() > > LOG.info("Demai @ReplicationPeer : clusterkey=" + key + ",id=" + id); > LOG.info("Demai @ReplicationPeer : this.zkw.quom =" + > this.zkw.getQuorum()); *//Quorum is incorrect* > LOG.info("Demai @ReplicationPeer : this.zkw=" + this.zkw.toString()); > } > > > and on the problematic cluster, the ReplicationPeer.zkw.quorum is wrong > > 2013-11-01 12:40:33,351 INFO > org.apache.hadoop.hbase.replication.ReplicationPeer: Demai @ReplicationPeer > : clusterkey=6,id=hdtest014.svl.ibm.com:2181:/hbase > 2013-11-01 12:40:33,351 INFO > org.apache.hadoop.hbase.replication.ReplicationPeer: Demai @ReplicationPeer > : this.zkw.quom =*bdvm134.svl.ibm.com:2181* > 2013-11-01 12:40:33,351 INFO > org.apache.hadoop.hbase.replication.ReplicationPeer: Demai @ReplicationPeer > : this.zkw=connection to cluster: hdtest014.svl.ibm.com:2181:/hbase > > > > On Fri, Nov 1, 2013 at 11:12 AM, Demai Ni <[email protected]> wrote: > >> Himanshu and Nick, >> >> many thanks for your help. I don't have all the answers to Nick's >> questions, since the deployment is built by another team and combined with >> a lot of other components like zookeeper, hadoop, hbase, hive, oozie, etc. >> >> I followed Himanshu's suggestion and checked the hbase.id on two >> different problematic cluster, they are different. So seems normal to me. >> About the deployment. I did clean install(well, at least that is my >> intention), and not re-using existing znodes. The installation is to stop >> everything(zookeeper, hadoop, hbase, etc), remove all the files and data; >> then install everything. so should be nothing left over. >> >> Let me describe current setup and my investigation so far. Rows can be >> replicated from the correct cluster to problematic cluster, but can't be >> replicated from the problematic one EVEN with both have the same hbase.jar. >> >> ** Problematic Cluster: * >> name = bdvm134 >> /hbase/hbase.id = $b13a0e3a-2bec-4e13-8b1d-043aa1a66443 >> > list_peers (I put two there just for debug purpose) >> PEER_ID CLUSTER_KEY STATE >> 6 hdtest014.svl.ibm.com:2181:/hbase ENABLED >> 7 hdtest014.svl.ibm.com:2181:/hbase ENABLED >> >> >> ** Correct Cluster: * >> name = hdtest014 >> /hbase/hbase.id = ce41a00d-5b0c-44b2-8bf7-bfd35bda1d42 >> > list_peers >> PEER_ID CLUSTER_KEY STATE >> 1 bdvm134.svl.ibm.com:2181:/hbase ENABLED >> >> >> I injected some debugging code into ReplicationSource.run() >> public void run() { >> .... >> >> LOG.info("Replicating "+clusterId + " -> " + peerClusterId); >> >> Map<String, ReplicationPeer> peerList = zkHelper.getPeerClusters(); >> >> for (Map.Entry<String, ReplicationPeer> peer : peerList.entrySet()) { >> LOG.info("Demai ---------------begin"); >> String peerId_A = peer.getKey(); >> ReplicationPeer rPeer = peer.getValue(); >> try { >> LOG.info("clusterUUId = " + zkHelper.getUUIDForCluster( >> zkHelper.getZookeeperWatcher())); >> LOG.info("peerUUID = " + zkHelper.getPeerUUID(peerId_A)); >> } catch (KeeperException e) { >> LOG.info("exception = " + e); >> } >> >> LOG.info("peerID = " + peerId_A); >> LOG.info("peer Value=" + rPeer.toString()); >> >> List<ServerName> sList = zkHelper.getSlavesAddresses(peerId_A); >> for (ServerName sName : sList) { >> LOG.info("sName = " + sName.getHostname()); *// value incorrect >> on problematic cluster* >> } >> LOG.info("Peer Cluster=" + rPeer.getClusterKey() + ",Peer ID = " + >> rPeer.getId()); >> LOG.info("Demai ---------------end"); >> } >> ... >> } >> >> >> >> on bdvm134- regionserver: >> 2013-11-01 10:20:44,757 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening >> log for replication bdvm134.svl.ibm.com%2C60020%2C1383324585548.1383324589592 >> at 3073 >> 2013-11-01 10:20:44,761 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> Replicating b13a0e3a-2bec-4e13-8b1d-043aa1a66443 -> >> b13a0e3a-2bec-4e13-8b1d-043aa1a66443 >> 2013-11-01 10:20:44,761 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Demai >> ---------------begin >> 2013-11-01 10:20:44,773 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> clusterUUId = b13a0e3a-2bec-4e13-8b1d-043aa1a66443 >> 2013-11-01 10:20:44,777 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> peerUUID = b13a0e3a-2bec-4e13-8b1d-043aa1a66443 >> 2013-11-01 10:20:44,777 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: peerID >> = 6 >> 2013-11-01 10:20:44,777 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: peer >> Value=org.apache.hadoop.hbase.replication.ReplicationPeer@33bb33bb >> 2013-11-01 10:20:44,779 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: sName = >> bdvm134.svl.ibm.com >> 2013-11-01 10:20:44,779 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Peer >> Cluster=6,Peer ID = hdtest014.svl.ibm.com:2181:/hbase >> 2013-11-01 10:20:44,779 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Demai >> ---------------end >> 2013-11-01 10:20:44,779 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Demai >> ---------------begin >> 2013-11-01 10:20:44,786 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> clusterUUId = b13a0e3a-2bec-4e13-8b1d-043aa1a66443 >> 2013-11-01 10:20:44,790 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> peerUUID = b13a0e3a-2bec-4e13-8b1d-043aa1a66443 >> 2013-11-01 10:20:44,790 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: peerID >> = 7 >> 2013-11-01 10:20:44,790 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: peer >> Value=org.apache.hadoop.hbase.replication.ReplicationPeer@710071 >> 2013-11-01 10:20:44,792 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: sName = >> *bdvm134.svl.ibm.com* >> 2013-11-01 10:20:44,792 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Peer >> Cluster=7,Peer ID = *hdtest014.svl.ibm.com*:2181:/hbase >> 2013-11-01 10:20:44,792 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Demai >> ---------------end >> 2013-11-01 10:20:44,794 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening >> log for replication bdvm134.svl.ibm.com%2C60020%2C1383324585548.1383324589592 >> at 3073 >> >> >> on hdtest014 regionsever: >> 2013-11-01 10:25:01,260 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> Replicating ce41a00d-5b0c-44b2-8bf7-bfd35bda1d42 -> >> b13a0e3a-2bec-4e13-8b1d-043aa1a66443 >> 2013-11-01 10:25:01,260 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Demai >> ---------------begin >> 2013-11-01 10:25:01,263 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> clusterUUId = ce41a00d-5b0c-44b2-8bf7-bfd35bda1d42 >> 2013-11-01 10:25:01,279 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> peerUUID = b13a0e3a-2bec-4e13-8b1d-043aa1a66443 >> 2013-11-01 10:25:01,279 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: peerID >> = 1 >> 2013-11-01 10:25:01,279 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: peer >> Value=org.apache.hadoop.hbase.replication.ReplicationPeer@70897089 >> 2013-11-01 10:25:01,281 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: sName = >> *bdvm134.svl.ibm.com* >> 2013-11-01 10:25:01,281 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Peer >> Cluster=1,Peer ID = *bdvm134.svl.ibm.com*:2181:/hbase >> 2013-11-01 10:25:01,281 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Demai >> ---------------end >> >> >> >> >> On Fri, Nov 1, 2013 at 10:07 AM, Nick Dimiduk <[email protected]> wrote: >> >>> Are you re-deploying over an existing installation? Is it your intention >>> to >>> preserve data between deployments or are you running in a testing >>> environment? Are you clearing ZK as part of deploying a fresh cluster or >>> are you re-using existing znodes? How did you configure replication in >>> the >>> shell? Can you provide those commands? I'd request debug logs from >>> o.a.h.h.regionserver.Replication but i don't see much logging in there >>> anyway. >>> >>> Basically, can you repro this in a fresh deployment? As Himanshu points >>> out, I'm suspect of stale configuration hanging around. >>> >>> >>> On Thu, Oct 31, 2013 at 8:02 PM, Demai Ni <[email protected]> wrote: >>> >>> > Nick, >>> > >>> > thanks for looking into this problem. I attached the hbase-site.xml in >>> > this email. Just like to point out that I have to tear down the >>> cluster I >>> > posted the original log. so the hbase-site.xml is from another >>> > cluster(single-node) with the same problem. >>> > >>> > BTW, I did some investigation this afternoon and don't think this is a >>> > problem within hbase code. (background: I am working within a software >>> > team, and quite a few engineers change hbase, hadoop, and other codes >>> > everyday)I tried out several different installations, and found out a >>> week >>> > ago's build with today's hbase build work just fine; but today's build >>> with >>> > last week's hbase doesn't. Our build includes hadoop 2, which can >>> introduce >>> > something problematic. >>> > >>> > wondering how hbase generate UUID? maybe that is something I should >>> look >>> > into? thanks >>> > >>> > Demai >>> > >>> > >>> > >>> > >>> > >>> > On Thu, Oct 31, 2013 at 6:20 PM, Nick Dimiduk <[email protected]> >>> wrote: >>> > >>> >> Can you post your replication settings from hbase-site.xml? >>> >> >>> >> On Thursday, October 31, 2013, Demai Ni wrote: >>> >> >>> >> > hi, folks, >>> >> > >>> >> > I got a strange thing happening on my cluster(hbase 0.94.9) >>> recently. I >>> >> am >>> >> > setting up a new cluster for replication, and didn't see the data >>> being >>> >> > replicated over the peer. Then, I found the following in the log of >>> the >>> >> > regionserver of the Master: >>> >> > >>> >> > 2013-10-31 13:33:03,293 INFO org.apache.hadoop.hbase.metrics: new >>> >> MBeanInfo >>> >> > 2013-10-31 13:33:03,300 INFO >>> >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >>> >> Getting >>> >> > 1 rs from peer cluster # 3 >>> >> > 2013-10-31 13:33:03,300 INFO >>> >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >>> >> > Choosing peer hdtest018.svl.ibm.com,60020,1383251582072 >>> >> > 2013-10-31 13:33:03,302 INFO >>> >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >>> >> > Replicating *b520de1d-3a18-4aec-bd45-de000e81417d* -> * >>> >> > b520de1d-3a18-4aec-bd45-de000e81417d* >>> >> > >>> >> > the log is from ReplicationSource: >>> >> > *LOG.info("Replicating "+clusterId + " -> " + peerClusterId);* >>> >> > >>> >> > It seems the problematic cluster is replicating to itself. >>> >> > Any suggestion about how to look into this problem? Many thanks >>> >> > >>> >> > BTW, I can replicate from another cluster to this problematic one. >>> >> > >>> >> > Demai >>> >> > >>> >> >>> > >>> > >>> >> >> >
