1.8TB in a day is not terrible slow if that number comes from the CopyTable counters and you are moving data across data centers using public networks, that should be about 20MB/sec. Also, CopyTable won't compress anything on the wire so the network overhead should be a lot. If you use anything like snappy for block compression and/or fast_diff for block encoding the HFiles, then using snapshots and export them using the ExportSnapshot tool should be the way to go.
cheers, esteban. -- Cloudera, Inc. On Thu, Aug 14, 2014 at 11:24 PM, tobe <[email protected]> wrote: > Thank @lars. > > We're using HBase 0.94.11 and follow the instruction to run `./bin/hbase > org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=hbase://cluster_name > table_name`. We have namespace service to find the ZooKeeper with > "hbase://cluster_name". And the job ran on a shared yarn cluster. > > The performance is affected by many factors, but we haven't found out the > reason. It would be great to see your suggestions. > > > On Fri, Aug 15, 2014 at 1:34 PM, lars hofhansl <[email protected]> wrote: > > > What version of HBase? How are you running CopyTable? A day for 1.8T is > > not what we would expect. > > You can definitely take a snapshot and then export the snapshot to > another > > cluster, which will move the actual files; but CopyTable should not be so > > slow. > > > > > > -- Lars > > > > > > > > ________________________________ > > From: tobe <[email protected]> > > To: "[email protected]" <[email protected]> > > Cc: [email protected] > > Sent: Thursday, August 14, 2014 8:18 PM > > Subject: A better way to migrate the whole cluster? > > > > > > Sometimes our users want to upgrade their servers or move to a new > > datacenter, then we have to migrate the data from HBase. Currently we > > enable the replication from the old cluster to the new cluster, and run > > CopyTable to move the older data. > > > > It's a little inefficient. It takes more than one day to migrate 1.8T > data > > and more time to verify. Can we have a better way to do that, like > snapshot > > or purely HDFS files? > > > > And what's the best practise or your valuable experience? > > >
