Hello list, I got to backup a huge changing/growing collection of files (~15 million/~6TB) and the existing rsync solution just doenst cut it any more. Below I will explain the situation, what the problem is and my current train of thought. Where you people can help is providing me with a sound bord, and hopefully some out-of-the-box thinking :D.
current situation: - We are a community website providing photo/video sharing as part of the overall package. - Our collection of video's and photo's is surpassing 15.3 million. - Growth rate is estimated at 50.000 files a day, a rate which ofc grows itself overtime. - The collection is saved on a 9TB system. - The backups are two off-site 4TB systems, the collections needs to be split over these. - Our backup-window is the whole day as long as this does not provide a performance drain. Reality is that we need to use the quiet night hours 0 to 8. - The collection is stored in a set of subdirectories each containing 50.000 files. (1-50000,50001-100000, etc). There are ~300 subdirs in use now. - Files are never deleted. - In the future it can happen that files change. my exception is that atmost a few thousand files a day will change, scattered over the whole collection with an emphasis on the most recent files. - We only need a mirrored version of the collection, no need for weekly snapshot or incremental snapshot.. just a copy of the collection. - It takes rsync 30 minutes to check that a directory needs no updating, It will take longer if it does. - The collection can not be taken offline during backups, it needs to be accesible all the time. What we use now is a rsync on the last directories, this is by the grace that files are currently immutable. This will change though. Train of thought: Remembering that (in the recent future) changes can happen in the whole collection. We will need to compare the collection to its back-upped/mirrored counterpart. Figuring that running rsync on 300 subdirectory's will take atleast 150 hours makes it not a viable option. We can assume that the rsync people are smart and that there is no faster ways to compare such huge collections. Another option is just copying the whole collection every night. Assuming a sustained transfer rate of 50MB/sec (20 is more likely i think, but no experience) it will take 33 hours. This method generates a huge disk load, which is unacceptable especially for more then a few hours. So we cannot construct a list of differences within an acceptable time, neither can we copy all the files. Combining these to I got the following brainstorm: It is in-effecient to reconstruct the changes you made during the day, why not save this knowledge when its allready available (while editing/saving the files). saving all the changed filesnames in a queue/list, allows us to read this list at the end of the day and only copy the needed files, atmost 60.000 (new + changed). It is also possible to make this a daemon process, keeping backup up-to-date almost realtime with relative little load, you could built-in rates to decrease the performace drain during peak hours. This could just work :D Only problem is constructing the list and capturing the knowledge while it is available, two options exist: At system level this can be done using for example I-notify, this requires a user-daemon. If the daemon crashes changes will be missed though. At application (the one making the changes) level this can also be done, when the application crashes no changes are made, so nothing is missed. But it does require making the backup dependent on the application. Not an ideal situation. Do you have any idea's/hints? maybe options i have not seen, even when they sound silly. They might trigger a fresh idea :D [1] http://www-128.ibm.com/developerworks/linux/library/l-inotify.html with regards, Jos Houtman [EMAIL PROTECTED] -- [email protected] mailing list
