the problem is figured out by one of my co-worker. Someone put a zoo.cfg
under ./hbase/conf, which messed up the quorum look up.

many thanks for your help


On Fri, Nov 1, 2013 at 12:48 PM, Demai Ni <[email protected]> wrote:

> I injected more debug code into ReplicationPeer.
>
>  public ReplicationPeer(Configuration conf, String key,
>       String id) throws IOException {
>     this.conf = conf;
>     this.clusterKey = key;
>     this.id = id;
>     this.reloadZkWatcher()
>
>     LOG.info("Demai @ReplicationPeer : clusterkey=" + key + ",id=" + id);
>     LOG.info("Demai @ReplicationPeer : this.zkw.quom =" +
> this.zkw.getQuorum()); *//Quorum is incorrect*
>     LOG.info("Demai @ReplicationPeer : this.zkw=" + this.zkw.toString());
>   }
>
>
> and on the problematic cluster, the ReplicationPeer.zkw.quorum is wrong
>
> 2013-11-01 12:40:33,351 INFO
> org.apache.hadoop.hbase.replication.ReplicationPeer: Demai @ReplicationPeer
> : clusterkey=6,id=hdtest014.svl.ibm.com:2181:/hbase
> 2013-11-01 12:40:33,351 INFO
> org.apache.hadoop.hbase.replication.ReplicationPeer: Demai @ReplicationPeer
> : this.zkw.quom =*bdvm134.svl.ibm.com:2181*
> 2013-11-01 12:40:33,351 INFO
> org.apache.hadoop.hbase.replication.ReplicationPeer: Demai @ReplicationPeer
> : this.zkw=connection to cluster: hdtest014.svl.ibm.com:2181:/hbase
>
>
>
> On Fri, Nov 1, 2013 at 11:12 AM, Demai Ni <[email protected]> wrote:
>
>> Himanshu and Nick,
>>
>> many thanks for your help.  I don't have all the answers to Nick's
>> questions, since the deployment is built by another team and combined with
>> a lot of other components like zookeeper, hadoop, hbase, hive, oozie, etc.
>>
>> I followed Himanshu's suggestion and checked the hbase.id on two
>> different problematic cluster, they are different. So seems normal to me.
>> About the deployment. I did clean install(well, at least that is my
>> intention), and not re-using existing znodes. The installation is to stop
>> everything(zookeeper, hadoop, hbase, etc), remove all the files and data;
>> then install everything. so should be nothing left over.
>>
>> Let me describe current setup and my investigation so far. Rows can be
>> replicated from the correct cluster to problematic cluster, but can't be
>> replicated from the problematic one EVEN with both have the same hbase.jar.
>>
>> ** Problematic Cluster: *
>> name = bdvm134
>> /hbase/hbase.id =  $b13a0e3a-2bec-4e13-8b1d-043aa1a66443
>> > list_peers  (I put two there just for debug purpose)
>>  PEER_ID CLUSTER_KEY STATE
>>  6 hdtest014.svl.ibm.com:2181:/hbase ENABLED
>>  7 hdtest014.svl.ibm.com:2181:/hbase ENABLED
>>
>>
>> ** Correct Cluster: *
>> name = hdtest014
>> /hbase/hbase.id = ce41a00d-5b0c-44b2-8bf7-bfd35bda1d42
>> > list_peers
>>  PEER_ID CLUSTER_KEY STATE
>>  1 bdvm134.svl.ibm.com:2181:/hbase ENABLED
>>
>>
>> I injected some debugging code into ReplicationSource.run()
>> public void run() {
>>   ....
>>
>>     LOG.info("Replicating "+clusterId + " -> " + peerClusterId);
>>
>>     Map<String, ReplicationPeer> peerList = zkHelper.getPeerClusters();
>>
>>     for (Map.Entry<String, ReplicationPeer> peer : peerList.entrySet()) {
>>       LOG.info("Demai ---------------begin");
>>       String peerId_A = peer.getKey();
>>       ReplicationPeer rPeer = peer.getValue();
>>       try {
>>         LOG.info("clusterUUId = " + zkHelper.getUUIDForCluster(
>> zkHelper.getZookeeperWatcher()));
>>         LOG.info("peerUUID = " + zkHelper.getPeerUUID(peerId_A));
>>       } catch (KeeperException e) {
>>         LOG.info("exception = " + e);
>>       }
>>
>>       LOG.info("peerID = " + peerId_A);
>>       LOG.info("peer Value=" + rPeer.toString());
>>
>>       List<ServerName> sList = zkHelper.getSlavesAddresses(peerId_A);
>>       for (ServerName sName : sList) {
>>         LOG.info("sName = " + sName.getHostname()); *// value incorrect
>> on problematic cluster*
>>       }
>>       LOG.info("Peer Cluster=" + rPeer.getClusterKey() + ",Peer ID = " +
>> rPeer.getId());
>>       LOG.info("Demai ---------------end");
>>     }
>> ...
>> }
>>
>>
>>
>> on bdvm134- regionserver:
>> 2013-11-01 10:20:44,757 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
>> log for replication bdvm134.svl.ibm.com%2C60020%2C1383324585548.1383324589592
>> at 3073
>> 2013-11-01 10:20:44,761 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> Replicating b13a0e3a-2bec-4e13-8b1d-043aa1a66443 ->
>> b13a0e3a-2bec-4e13-8b1d-043aa1a66443
>> 2013-11-01 10:20:44,761 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Demai
>> ---------------begin
>> 2013-11-01 10:20:44,773 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> clusterUUId = b13a0e3a-2bec-4e13-8b1d-043aa1a66443
>> 2013-11-01 10:20:44,777 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> peerUUID = b13a0e3a-2bec-4e13-8b1d-043aa1a66443
>> 2013-11-01 10:20:44,777 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: peerID
>> = 6
>> 2013-11-01 10:20:44,777 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: peer
>> Value=org.apache.hadoop.hbase.replication.ReplicationPeer@33bb33bb
>> 2013-11-01 10:20:44,779 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: sName =
>> bdvm134.svl.ibm.com
>> 2013-11-01 10:20:44,779 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Peer
>> Cluster=6,Peer ID = hdtest014.svl.ibm.com:2181:/hbase
>> 2013-11-01 10:20:44,779 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Demai
>> ---------------end
>> 2013-11-01 10:20:44,779 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Demai
>> ---------------begin
>> 2013-11-01 10:20:44,786 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> clusterUUId = b13a0e3a-2bec-4e13-8b1d-043aa1a66443
>> 2013-11-01 10:20:44,790 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> peerUUID = b13a0e3a-2bec-4e13-8b1d-043aa1a66443
>> 2013-11-01 10:20:44,790 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: peerID
>> = 7
>> 2013-11-01 10:20:44,790 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: peer
>> Value=org.apache.hadoop.hbase.replication.ReplicationPeer@710071
>> 2013-11-01 10:20:44,792 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: sName =
>> *bdvm134.svl.ibm.com*
>> 2013-11-01 10:20:44,792 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Peer
>> Cluster=7,Peer ID = *hdtest014.svl.ibm.com*:2181:/hbase
>> 2013-11-01 10:20:44,792 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Demai
>> ---------------end
>> 2013-11-01 10:20:44,794 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
>> log for replication bdvm134.svl.ibm.com%2C60020%2C1383324585548.1383324589592
>> at 3073
>>
>>
>> on hdtest014 regionsever:
>> 2013-11-01 10:25:01,260 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> Replicating ce41a00d-5b0c-44b2-8bf7-bfd35bda1d42 ->
>> b13a0e3a-2bec-4e13-8b1d-043aa1a66443
>> 2013-11-01 10:25:01,260 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Demai
>> ---------------begin
>> 2013-11-01 10:25:01,263 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> clusterUUId = ce41a00d-5b0c-44b2-8bf7-bfd35bda1d42
>> 2013-11-01 10:25:01,279 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> peerUUID = b13a0e3a-2bec-4e13-8b1d-043aa1a66443
>> 2013-11-01 10:25:01,279 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: peerID
>> = 1
>> 2013-11-01 10:25:01,279 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: peer
>> Value=org.apache.hadoop.hbase.replication.ReplicationPeer@70897089
>> 2013-11-01 10:25:01,281 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: sName =
>> *bdvm134.svl.ibm.com*
>> 2013-11-01 10:25:01,281 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Peer
>> Cluster=1,Peer ID = *bdvm134.svl.ibm.com*:2181:/hbase
>> 2013-11-01 10:25:01,281 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Demai
>> ---------------end
>>
>>
>>
>>
>> On Fri, Nov 1, 2013 at 10:07 AM, Nick Dimiduk <[email protected]> wrote:
>>
>>> Are you re-deploying over an existing installation? Is it your intention
>>> to
>>> preserve data between deployments or are you running in a testing
>>> environment? Are you clearing ZK as part of deploying a fresh cluster or
>>> are you re-using existing znodes? How did you configure replication in
>>> the
>>> shell? Can you provide those commands? I'd request debug logs from
>>> o.a.h.h.regionserver.Replication but i don't see much logging in there
>>> anyway.
>>>
>>> Basically, can you repro this in a fresh deployment? As Himanshu points
>>> out, I'm suspect of stale configuration hanging around.
>>>
>>>
>>> On Thu, Oct 31, 2013 at 8:02 PM, Demai Ni <[email protected]> wrote:
>>>
>>> > Nick,
>>> >
>>> > thanks for looking into this problem. I attached the hbase-site.xml in
>>> > this email. Just like to point out that I have to tear down the
>>> cluster I
>>> > posted the original log. so the hbase-site.xml is from another
>>> > cluster(single-node) with the same problem.
>>> >
>>> > BTW, I did some investigation this afternoon and don't  think this is a
>>> > problem within hbase code. (background: I am working within a software
>>> > team, and quite a few engineers change hbase, hadoop, and other codes
>>> > everyday)I tried out several different installations, and found out a
>>> week
>>> > ago's build with today's hbase build work just fine; but today's build
>>> with
>>> > last week's hbase doesn't. Our build includes hadoop 2, which can
>>> introduce
>>> > something problematic.
>>> >
>>> > wondering how hbase generate UUID? maybe that is something I should
>>> look
>>> > into? thanks
>>> >
>>> > Demai
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Oct 31, 2013 at 6:20 PM, Nick Dimiduk <[email protected]>
>>> wrote:
>>> >
>>> >> Can you post your replication settings from hbase-site.xml?
>>> >>
>>> >> On Thursday, October 31, 2013, Demai Ni wrote:
>>> >>
>>> >> > hi, folks,
>>> >> >
>>> >> > I got a strange thing happening on my cluster(hbase 0.94.9)
>>> recently. I
>>> >> am
>>> >> > setting up a new cluster for replication, and didn't see the data
>>> being
>>> >> > replicated over the peer. Then, I found the following in the log of
>>> the
>>> >> > regionserver of the Master:
>>> >> >
>>> >> > 2013-10-31 13:33:03,293 INFO org.apache.hadoop.hbase.metrics: new
>>> >> MBeanInfo
>>> >> > 2013-10-31 13:33:03,300 INFO
>>> >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>>> >> Getting
>>> >> > 1 rs from peer cluster # 3
>>> >> > 2013-10-31 13:33:03,300 INFO
>>> >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>>> >> > Choosing peer hdtest018.svl.ibm.com,60020,1383251582072
>>> >> > 2013-10-31 13:33:03,302 INFO
>>> >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>>> >> > Replicating *b520de1d-3a18-4aec-bd45-de000e81417d* -> *
>>> >> > b520de1d-3a18-4aec-bd45-de000e81417d*
>>> >> >
>>> >> > the log is from ReplicationSource:
>>> >> > *LOG.info("Replicating "+clusterId + " -> " + peerClusterId);*
>>> >> >
>>> >> > It seems the problematic cluster is replicating to itself.
>>> >> > Any suggestion about how to look into this problem? Many thanks
>>> >> >
>>> >> > BTW, I can replicate from another cluster to this problematic one.
>>> >> >
>>> >> > Demai
>>> >> >
>>> >>
>>> >
>>> >
>>>
>>
>>
>

Reply via email to