[jira] [Commented] (HBASE-12814) Zero downtime upgrade from 94 to 98

Andrew Purtell (JIRA) Wed, 14 Jan 2015 22:21:06 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-12814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278315#comment-14278315
 ]


Andrew Purtell commented on HBASE-12814:
----------------------------------------

The thrift based replication protocol option is a very nice enhancement. We 
will have to bring it into all branches not just 0.98 to avoid a post-0.98 
functional regression. 

I tried to apply the 0.98 patch to the head of 0.98 branch but there were a few 
rejects: 1 hunk in ReplicationPeer, 1 in ZooKeeperProtos, 2 in 
HBaseInterClusterReplicationEndpoint, 3 in ReplicationSink, 1 in 
TestReplicationAdmin, 2 in pom.xml. I think a quick rebase would fix that.

All other attempts at version independent data storage or transfer in 0.98+ 
employ protobufs, not Thrift. The Thrift gateway is an exception but it's an 
access gateway not a core component. I don't have a strong opinion about it, 
especially given the amount of work already invested here, but I'm curious why 
it was necessary to use Thrift and to now have thrift as a dependency of 
hbase-server. Now we have embedded Thrift servers in regionservers again, with 
corresponding new service endpoints, thread pools, and metrics. The thrift 
server even does its own slow op detection. Is this essentially reintroduction 
of Thrift as alternate RPC stack in the regionservers, like the Facebook work 
previously removed? Are we ok with that? If we only want to run the THRIFT 
protocol during zero time upgrade does putting all of this code and the thrift 
dependency into hbase-server make sense? Or can we instead make this a 
pluggable replication endpoint implementation in its own module? Or do you 
think we might want to make much more use of THRIFT replication protocol than 
that? If so I suppose we need to characterize the performance difference and 
mark it experimental until there is sufficient experience in production with it.

We'd have separate metrics for replication and for the thrift service? 
Shouldn't they be combined since the thrift service is only supporting 
replication? 

Core replication code now has to implement Thrift interfaces? Will we have 
issues if we change Thrift versions? 
{code}
public class ReplicationSink implements THBaseService.Iface {
{code}

In ReplicationAdmin this patch adds an immediately deprecated method? Did you 
mean to deprecate another one?
{code}
@@ -155,6 +156,16 @@ public class ReplicationAdmin implements Closeable {
       new ReplicationPeerConfig().setClusterKey(clusterKey), tableCFs);
   }
 
+  @Deprecated
+  public void addPeer(String id, String clusterKey, String tableCFs, String 
protocol)
+      throws ReplicationException {
+    ReplicationPeerConfig config = new 
ReplicationPeerConfig().setClusterKey(clusterKey);
+    if (StringUtils.isNotBlank(protocol)) {
+      config = 
config.setProtocol(ReplicationPeer.PeerProtocol.valueOf(protocol));
+    }
+    this.replicationPeers.addPeer(id, config, tableCFs);
+  }
+
   /**
    * Add a new remote slave cluster for replication.
    * @param id a short name that identifies the cluster
{code}

I just did a skim. Will take another pass after further discussion.

> Zero downtime upgrade from 94 to 98 
> ------------------------------------
>
>                 Key: HBASE-12814
>                 URL: https://issues.apache.org/jira/browse/HBASE-12814
>             Project: HBase
>          Issue Type: New Feature
>    Affects Versions: 0.94.26, 0.98.10
>            Reporter: churro morales
>            Assignee: churro morales
>         Attachments: HBASE-12814-0.94.patch, HBASE-12814-0.98.patch
>
>
> Here at Flurry we want to upgrade our HBase cluster from 94 to 98 while not 
> having any downtime and maintaining master / master replication. 
> Summary:
> Replication is done via thrift RPC between clusters.  It is configurable on a 
> peer by peer basis and the one caveat is that a thrift server starts up on 
> every node which proxies the request to the ReplicationSink.  
> For the upgrade process:
> * in hbase-site.xml two new configuration parameters are added:
> ** *Required*
> *** hbase.replication.sink.enable.thrift -> true
> *** hbase.replication.thrift.server.port -> <thrit_server_port>
> ** *Optional*
> *** hbase.replication.thrift.protection {default: AUTHENTICATION}
> *** hbase.replication.thrift.framed {default: false}
> *** hbase.replication.thrift.compact {default: true}
> - All regionservers can be rolling restarted (no downtime), all clusters must 
> have the respective patch for this to work.
> - the hbase shell add_peer command takes an additional parameter for rpc 
> protocol
> - example: {code} add_peer '1' "hbase-101:2181:/hbase", "THRIFT" {code}
> Now comes the fun part when you want to upgrade your cluster from 94 to 98 
> you simply pause replication to the cluster being upgraded, do the upgrade 
> and un-pause replication.  Once you have a pair of clusters only replicating 
> inbound and outbound with the 98 release.  You can start replicating via the 
> native rpc protocol by adding the peer again without the _THRIFT_ parameter 
> and subsequently deleting the peer with the thrift protocol.  Because 
> replication is idempotent I don't see any issues as long as you wait for the 
> backlog to drain after un-pausing replication. 
> Special thanks to Francis Liu at Yahoo for laying the groundwork and Mr. Dave 
> Latham for his invaluable knowledge and assistance.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12814) Zero downtime upgrade from 94 to 98

Reply via email to