hi,

I was just going to ask this on hadoop list, but luckily I checked this one
first.

I've been also trying to search net about backup solutions hdfs, but there
isn't too much information available.
So, I'd dare to say that it hasn't been asked myriad of times. ;)

I found this question (which is basically the same question I ask now)
http://www.quora.com/Whats-the-right-way-to-backup-Hadoop
with 4 suggestions for solution. Of those
1. hdfs + fuse -> iirc there might be some scaling problems and you still
need to copy that data somewhere)
2. flume (or similar) -> at least flume isn't reliable enough, which we
have been testing and using for
collecting some logs to hadoop
3. high-degree of replication in hdfs -> isn't actually a backup
4. backup hdfs every hour/day/other interval to locally mounted fs:
http://blog.rapleaf.com/dev/2009/06/05/backing-up-hadoops-hdfs/

In addition to above ones
5. apparently some are using distcp, but some (others or same) claim that
it is unreliable. And it was mentioned here as well.
6. Then there is also Mozilla's alternative to distcp:
http://blog.mozilla.com/data/2011/02/04/migrating-hbase-in-the-trenches/

So, for you (and maybe to us as well) 4. approach might be most feasible
option.
I quickly tested that Backup.java of 4. and it seemed to get data. However,
I haven't yet done
any decent tests and I have no clue how reliable or high performing it is.
May have at least scaling problems.
Might still be worth checking out.

Regards,
Ossi


On Tue, Jan 3, 2012 at 10:53 PM, Mac Noland <mcdonaldnol...@yahoo.com>wrote:

> Good day,
>
> I’m guessing this question been asked a myriad of times, but
> we’re about to get serious with some of our Hadoop implementations so I
> wanted
> to re-ask to see if I’m missing anything, or if others happen to know if
> this might
> be on a future road map.
>
> For our current storage offerings (e.g. NAS or SAN), we give
> businesses the opportunity to choose 7, 14, or 45 day “backups” for their
> storage.   The purpose of the backup isn’t
> so much as they are worried about losing their current data (we’re RAID’ed
> and  have some stuff mirrored to remote
> datacenters), but more so if they were to delete some data today, they can
> recover from yesterday’s backup.  Or the
> day before’s backup, or the day before that, etc.  And to be honest,
> business units buy a good portion of their backups to make people feel
> better and fulfill custom contracts.
>
>
> So far with HDFS we haven’t found too many formalized
> offerings for this specific feature.  While I haven’t done a ton of
> research, the best solution I’ve found is an
> idea where we’d schedule a job to pull the data locally to a mount that is
> backed up via our traditional methods.  See Michael Segel’s first post on
> this site
> http://lucene.472066.n3.nabble.com/Backing-up-HDFS-td1019184.html
>
> Though we’d have to work through the details of what this
> would look like for our support folks, it looks like something that could
> potentially fit into our current model.  We’d basically need to allocate
> the same amount of SAN or NAS disk as we
> have for HDFS, then coordinate a snap on the the SAN or NAS via our
> traditional
> methods.  Not sure what a restore would
> look like, other than we could give the end users read access to the NAS
> or SAN
> mounts so they can pick through what they need to recover and let them
> figure
> out how to get it back into HDFS.
>
> For use cases like ours where we’d need multi-day backups to
> fulfill business needs, is this kind of what people are thinking or
> doing?  Moreover, are there any things in the Hadoop
> HDFS road map for providing, for lack of a better word, an “enterprise”
> backup/restore solution?
>
> Thanks in advance,
>
> Mac Noland – Thomson Reuters
>

Reply via email to