The OP hasn't provided enough information to even start trying to make a real recommendation on how to solve this problem.
On Aug 4, 2012, at 7:32 AM, Nitin Kesarwani <bumble....@gmail.com> wrote: > Given the size of data, there can be several approaches here: > > 1. Moving the boxes > > Not possible, as I suppose the data must be needed for 24x7 analytics. > > 2. Mirroring the data. > > This is a good solution. However, if you have data being written/removed > continuously (if a part of live system), there are chances of losing some > of the data during mirroring happens, unless > a) You block writes/updates during that time (if you do so, that would be > as good as unplugging and moving the machine around), or, > b) Keep a track of what was modified since you started the mirroring > process. > > I would recommend you to go with 2b) because it minimizes downtime. Here is > how I think you can do it, by using some of the tools provided by Hadoop > itself. > > a) You can use some fast distributed copying tool to copy large chunks of > data. Before you kick-off with this, you can create a utility that tracks > the modification of data made to your live system while copying is going on > in the background. The utility will log the modifications into an audit > trail. > b) Once you're done copying the files, allow the new data store > replication to catch up by reading the real-time modifications that were > made, from your utility's log file. Once sync'ed up you can begin with the > minimal downtime by switching off the JobTracker in live cluster so that > new files are not created. > c) As soon as you reach the last chunk of copying, change the DNS entries > so that the hostnames referenced by the Hadoop jobs points to the new > location. > d) Turn on the JobTracker for the new cluster. > e) Enjoy a drink with the money you saved by not using other paid third > party solutions and pat your back! ;) > > The key of the above solution is to make data copying of step a) as fast as > possible. Lesser the time, lesser the contents in audit trail, lesser the > overall downtime. > > You can develop some in house solution for this, or use DistCp, provided by > Hadoop that uses copies over the data using Map/Reduce. > > > On Sat, Aug 4, 2012 at 3:27 AM, Michael Segel > <michael_se...@hotmail.com>wrote: > >> Sorry at 1PB of disk... compression isn't going to really help a whole >> heck of a lot. Your networking bandwidth will be your bottleneck. >> >> So lets look at the problem. >> >> How much down time can you afford? >> What does your hardware look like? >> How much space do you have in your current data center? >> >> You have 1PB of data. OK, what does the access pattern look like? >> >> There are a couple of ways to slice and dice this. How many trucks do you >> have? >> >> On Aug 3, 2012, at 4:24 PM, Harit Himanshu <harit.subscripti...@gmail.com> >> wrote: >> >>> Moving 1 PB of data would take loads of time, >>> - check if this new data center provides something similar to >> http://aws.amazon.com/importexport/ >>> - Consider multi part uploading of data >>> - consider compressing the data >>> >>> >>> On Aug 3, 2012, at 2:19 PM, Patai Sangbutsarakum wrote: >>> >>>> thanks for response. >>>> Physical move is not a choice in this case. Purely looking for copying >>>> data and how to catch up with the update of a file while it is being >>>> migrated. >>>> >>>> On Fri, Aug 3, 2012 at 12:40 PM, Chen He <airb...@gmail.com> wrote: >>>>> sometimes, physically moving hard drives helps. :) >>>>> On Aug 3, 2012 1:50 PM, "Patai Sangbutsarakum" < >> silvianhad...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Hadoopers, >>>>>> >>>>>> We have a plan to migrate Hadoop cluster to a different datacenter >>>>>> where we can triple the size of the cluster. >>>>>> Currently, our 0.20.2 cluster have around 1PB of data. We use only >>>>>> Java/Pig. >>>>>> >>>>>> I would like to get some input how we gonna handle with transferring >>>>>> 1PB of data to a new site, and also keep up with >>>>>> new files that thrown into cluster all the time. >>>>>> >>>>>> Happy friday !! >>>>>> >>>>>> P >>>>>> >>> >> >>