At TripAdvisor we use Hadoop and Hive for our warehousing needs. Processing the 
daily logs takes a long time, and re-processing them would be prohibitive.  As 
we couldn't find a backup solution, we put one together.  We open sourced it in 
hopes that it might be useful to others as well.  You can find it on github:

https://github.com/TAwarehouse/backup-hadoop-and-hive

The backup app traverses the hdfs filesystem looking for all files with mtime 
in a given range, then copying (a'la copyToLocal) the files to  local 
directory.  If hdfs were to crash, you can use "hadoop fs -copyFromLocal" to 
restore the filesystem contents.  The backup can be invoked incrementally to 
keep updating the local copy.  Files that would be overwritten get copied to a 
"preserved" area, so that older versions remain available.

This project also includes a dump of the hive schema, along with hql statements 
to reassociate the tables with hdfs partitions.  This portion came in very 
handy when we migrated our Hive backing db from derby to mysql.

Thanks go to Josh Patterson, Edward Capriolo, and Rapleaf for letting us use 
their hdfs-style checksum, hive show-create-table, hdfs traversal code.

For more info see the README:

https://github.com/TAwarehouse/backup-hadoop-and-hive/blob/master/README.txt


tom.
tpalka<at>tripadvisor<dot>com

Reply via email to