I injected more debug code into ReplicationPeer.

 public ReplicationPeer(Configuration conf, String key,
      String id) throws IOException {
    this.conf = conf;
    this.clusterKey = key;
    this.id = id;
    this.reloadZkWatcher()

    LOG.info("Demai @ReplicationPeer : clusterkey=" + key + ",id=" + id);
    LOG.info("Demai @ReplicationPeer : this.zkw.quom =" +
this.zkw.getQuorum()); *//Quorum is incorrect*
    LOG.info("Demai @ReplicationPeer : this.zkw=" + this.zkw.toString());
  }


and on the problematic cluster, the ReplicationPeer.zkw.quorum is wrong

2013-11-01 12:40:33,351 INFO
org.apache.hadoop.hbase.replication.ReplicationPeer: Demai @ReplicationPeer
: clusterkey=6,id=hdtest014.svl.ibm.com:2181:/hbase
2013-11-01 12:40:33,351 INFO
org.apache.hadoop.hbase.replication.ReplicationPeer: Demai @ReplicationPeer
: this.zkw.quom =*bdvm134.svl.ibm.com:2181*
2013-11-01 12:40:33,351 INFO
org.apache.hadoop.hbase.replication.ReplicationPeer: Demai @ReplicationPeer
: this.zkw=connection to cluster: hdtest014.svl.ibm.com:2181:/hbase



On Fri, Nov 1, 2013 at 11:12 AM, Demai Ni <[email protected]> wrote:

> Himanshu and Nick,
>
> many thanks for your help.  I don't have all the answers to Nick's
> questions, since the deployment is built by another team and combined with
> a lot of other components like zookeeper, hadoop, hbase, hive, oozie, etc.
>
> I followed Himanshu's suggestion and checked the hbase.id on two
> different problematic cluster, they are different. So seems normal to me.
> About the deployment. I did clean install(well, at least that is my
> intention), and not re-using existing znodes. The installation is to stop
> everything(zookeeper, hadoop, hbase, etc), remove all the files and data;
> then install everything. so should be nothing left over.
>
> Let me describe current setup and my investigation so far. Rows can be
> replicated from the correct cluster to problematic cluster, but can't be
> replicated from the problematic one EVEN with both have the same hbase.jar.
>
> ** Problematic Cluster: *
> name = bdvm134
> /hbase/hbase.id =  $b13a0e3a-2bec-4e13-8b1d-043aa1a66443
> > list_peers  (I put two there just for debug purpose)
>  PEER_ID CLUSTER_KEY STATE
>  6 hdtest014.svl.ibm.com:2181:/hbase ENABLED
>  7 hdtest014.svl.ibm.com:2181:/hbase ENABLED
>
>
> ** Correct Cluster: *
> name = hdtest014
> /hbase/hbase.id = ce41a00d-5b0c-44b2-8bf7-bfd35bda1d42
> > list_peers
>  PEER_ID CLUSTER_KEY STATE
>  1 bdvm134.svl.ibm.com:2181:/hbase ENABLED
>
>
> I injected some debugging code into ReplicationSource.run()
> public void run() {
>   ....
>
>     LOG.info("Replicating "+clusterId + " -> " + peerClusterId);
>
>     Map<String, ReplicationPeer> peerList = zkHelper.getPeerClusters();
>
>     for (Map.Entry<String, ReplicationPeer> peer : peerList.entrySet()) {
>       LOG.info("Demai ---------------begin");
>       String peerId_A = peer.getKey();
>       ReplicationPeer rPeer = peer.getValue();
>       try {
>         LOG.info("clusterUUId = " + zkHelper.getUUIDForCluster(
> zkHelper.getZookeeperWatcher()));
>         LOG.info("peerUUID = " + zkHelper.getPeerUUID(peerId_A));
>       } catch (KeeperException e) {
>         LOG.info("exception = " + e);
>       }
>
>       LOG.info("peerID = " + peerId_A);
>       LOG.info("peer Value=" + rPeer.toString());
>
>       List<ServerName> sList = zkHelper.getSlavesAddresses(peerId_A);
>       for (ServerName sName : sList) {
>         LOG.info("sName = " + sName.getHostname()); *// value incorrect
> on problematic cluster*
>       }
>       LOG.info("Peer Cluster=" + rPeer.getClusterKey() + ",Peer ID = " +
> rPeer.getId());
>       LOG.info("Demai ---------------end");
>     }
> ...
> }
>
>
>
> on bdvm134- regionserver:
> 2013-11-01 10:20:44,757 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
> log for replication bdvm134.svl.ibm.com%2C60020%2C1383324585548.1383324589592
> at 3073
> 2013-11-01 10:20:44,761 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Replicating b13a0e3a-2bec-4e13-8b1d-043aa1a66443 ->
> b13a0e3a-2bec-4e13-8b1d-043aa1a66443
> 2013-11-01 10:20:44,761 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Demai
> ---------------begin
> 2013-11-01 10:20:44,773 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> clusterUUId = b13a0e3a-2bec-4e13-8b1d-043aa1a66443
> 2013-11-01 10:20:44,777 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> peerUUID = b13a0e3a-2bec-4e13-8b1d-043aa1a66443
> 2013-11-01 10:20:44,777 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: peerID
> = 6
> 2013-11-01 10:20:44,777 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: peer
> Value=org.apache.hadoop.hbase.replication.ReplicationPeer@33bb33bb
> 2013-11-01 10:20:44,779 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: sName =
> bdvm134.svl.ibm.com
> 2013-11-01 10:20:44,779 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Peer
> Cluster=6,Peer ID = hdtest014.svl.ibm.com:2181:/hbase
> 2013-11-01 10:20:44,779 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Demai
> ---------------end
> 2013-11-01 10:20:44,779 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Demai
> ---------------begin
> 2013-11-01 10:20:44,786 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> clusterUUId = b13a0e3a-2bec-4e13-8b1d-043aa1a66443
> 2013-11-01 10:20:44,790 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> peerUUID = b13a0e3a-2bec-4e13-8b1d-043aa1a66443
> 2013-11-01 10:20:44,790 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: peerID
> = 7
> 2013-11-01 10:20:44,790 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: peer
> Value=org.apache.hadoop.hbase.replication.ReplicationPeer@710071
> 2013-11-01 10:20:44,792 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: sName =
> *bdvm134.svl.ibm.com*
> 2013-11-01 10:20:44,792 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Peer
> Cluster=7,Peer ID = *hdtest014.svl.ibm.com*:2181:/hbase
> 2013-11-01 10:20:44,792 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Demai
> ---------------end
> 2013-11-01 10:20:44,794 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
> log for replication bdvm134.svl.ibm.com%2C60020%2C1383324585548.1383324589592
> at 3073
>
>
> on hdtest014 regionsever:
> 2013-11-01 10:25:01,260 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Replicating ce41a00d-5b0c-44b2-8bf7-bfd35bda1d42 ->
> b13a0e3a-2bec-4e13-8b1d-043aa1a66443
> 2013-11-01 10:25:01,260 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Demai
> ---------------begin
> 2013-11-01 10:25:01,263 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> clusterUUId = ce41a00d-5b0c-44b2-8bf7-bfd35bda1d42
> 2013-11-01 10:25:01,279 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> peerUUID = b13a0e3a-2bec-4e13-8b1d-043aa1a66443
> 2013-11-01 10:25:01,279 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: peerID
> = 1
> 2013-11-01 10:25:01,279 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: peer
> Value=org.apache.hadoop.hbase.replication.ReplicationPeer@70897089
> 2013-11-01 10:25:01,281 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: sName =
> *bdvm134.svl.ibm.com*
> 2013-11-01 10:25:01,281 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Peer
> Cluster=1,Peer ID = *bdvm134.svl.ibm.com*:2181:/hbase
> 2013-11-01 10:25:01,281 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Demai
> ---------------end
>
>
>
>
> On Fri, Nov 1, 2013 at 10:07 AM, Nick Dimiduk <[email protected]> wrote:
>
>> Are you re-deploying over an existing installation? Is it your intention
>> to
>> preserve data between deployments or are you running in a testing
>> environment? Are you clearing ZK as part of deploying a fresh cluster or
>> are you re-using existing znodes? How did you configure replication in the
>> shell? Can you provide those commands? I'd request debug logs from
>> o.a.h.h.regionserver.Replication but i don't see much logging in there
>> anyway.
>>
>> Basically, can you repro this in a fresh deployment? As Himanshu points
>> out, I'm suspect of stale configuration hanging around.
>>
>>
>> On Thu, Oct 31, 2013 at 8:02 PM, Demai Ni <[email protected]> wrote:
>>
>> > Nick,
>> >
>> > thanks for looking into this problem. I attached the hbase-site.xml in
>> > this email. Just like to point out that I have to tear down the cluster
>> I
>> > posted the original log. so the hbase-site.xml is from another
>> > cluster(single-node) with the same problem.
>> >
>> > BTW, I did some investigation this afternoon and don't  think this is a
>> > problem within hbase code. (background: I am working within a software
>> > team, and quite a few engineers change hbase, hadoop, and other codes
>> > everyday)I tried out several different installations, and found out a
>> week
>> > ago's build with today's hbase build work just fine; but today's build
>> with
>> > last week's hbase doesn't. Our build includes hadoop 2, which can
>> introduce
>> > something problematic.
>> >
>> > wondering how hbase generate UUID? maybe that is something I should look
>> > into? thanks
>> >
>> > Demai
>> >
>> >
>> >
>> >
>> >
>> > On Thu, Oct 31, 2013 at 6:20 PM, Nick Dimiduk <[email protected]>
>> wrote:
>> >
>> >> Can you post your replication settings from hbase-site.xml?
>> >>
>> >> On Thursday, October 31, 2013, Demai Ni wrote:
>> >>
>> >> > hi, folks,
>> >> >
>> >> > I got a strange thing happening on my cluster(hbase 0.94.9)
>> recently. I
>> >> am
>> >> > setting up a new cluster for replication, and didn't see the data
>> being
>> >> > replicated over the peer. Then, I found the following in the log of
>> the
>> >> > regionserver of the Master:
>> >> >
>> >> > 2013-10-31 13:33:03,293 INFO org.apache.hadoop.hbase.metrics: new
>> >> MBeanInfo
>> >> > 2013-10-31 13:33:03,300 INFO
>> >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> >> Getting
>> >> > 1 rs from peer cluster # 3
>> >> > 2013-10-31 13:33:03,300 INFO
>> >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> >> > Choosing peer hdtest018.svl.ibm.com,60020,1383251582072
>> >> > 2013-10-31 13:33:03,302 INFO
>> >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> >> > Replicating *b520de1d-3a18-4aec-bd45-de000e81417d* -> *
>> >> > b520de1d-3a18-4aec-bd45-de000e81417d*
>> >> >
>> >> > the log is from ReplicationSource:
>> >> > *LOG.info("Replicating "+clusterId + " -> " + peerClusterId);*
>> >> >
>> >> > It seems the problematic cluster is replicating to itself.
>> >> > Any suggestion about how to look into this problem? Many thanks
>> >> >
>> >> > BTW, I can replicate from another cluster to this problematic one.
>> >> >
>> >> > Demai
>> >> >
>> >>
>> >
>> >
>>
>
>

Reply via email to