[ 
https://issues.apache.org/jira/browse/HBASE-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983372#action_12983372
 ] 

Daniel Einspanjer commented on HBASE-3451:
------------------------------------------

Diffing python script:
http://xstevens.pastebin.mozilla.org/956095

> Cluster migration best practices
> --------------------------------
>
>                 Key: HBASE-3451
>                 URL: https://issues.apache.org/jira/browse/HBASE-3451
>             Project: HBase
>          Issue Type: Brainstorming
>    Affects Versions: 0.20.6, 0.89.20100924
>            Reporter: Daniel Einspanjer
>            Priority: Critical
>
> Mozilla is currently in the process of trying to migrate our HBase cluster to 
> a new datacenter.
> We have our existing 25 node cluster in our SJC datacenter.  It is serving 
> production traffic 24/7.  While we can take downtimes, it is very costly and 
> difficult to take them for more than a few hours in the evening.
> We have two new 30 node clusters in our PHX datacenter.  We are wanting to 
> cut production over to one of these this week.
> The old cluster is running 0.20.6.  The new clusters are running CDH3b3 with 
> HBase 0.89.
> We have tried running a pull distcp using hftp URLs.  If HBase is running, 
> this causes SAX XML Parsing exceptions when a directory is removed during the 
> scan.
> If HBase is stopped, it takes hours for the directory compare to finish 
> before it even begins copying data.
> We have tried a custom backup MR job.  This job uses the map phase to 
> evaluate and copy changed files. It can run while HBase is live, but that 
> results in a dirty copy of the data.
> We have tried running the custom backup job while HBase is shut down as well. 
>  When we do this, even on two back to back runs, it still copies over some 
> data and seems to not be an entirely clean copy.
> When we have gotten what we thought was an entire copy onto the new cluster, 
> we ran add_table on it, but the resulting hbase table had holes.  
> Investigating the holes revealed there were directories that were not 
> transfered.
> We had a meeting to brainstorm ideas and two further suggestions that came up 
> were:
> 1. Build a file list of files to transfer on the SJC side, transfer that file 
> list to PHX and then run distcp on it.
> 2. Try a full copy instead of incremental, skipping the expensive file 
> compare step
> 3. Evaluate copying from SJC to S3 then from S3 to PHX.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to