[gentoo-server] Mirroring/backing-up a large 15Million (6TB) file collection

jos houtman Wed, 19 Apr 2006 05:11:19 -0700

Hello list,

I got to backup a huge changing/growing collection of files (~15
million/~6TB) and the existing rsync solution just doenst cut it any more.
Below I will explain the situation, what the problem is and my current
train of thought. Where you people can help is providing me with a sound
bord, and hopefully some out-of-the-box thinking :D.


current situation:
- We are a community website providing photo/video sharing as part of
the overall package.
- Our collection of video's and photo's is surpassing 15.3 million.
- Growth rate is estimated at 50.000 files a day, a rate which ofc grows
itself overtime.
- The collection is saved on a 9TB system.
- The backups are two off-site 4TB systems, the collections needs to be
split over these.
- Our backup-window is the whole day as long as this does not provide a
performance drain. Reality is that we need to use the quiet night hours
0 to 8.
- The collection is stored in a set of subdirectories each containing
50.000 files. (1-50000,50001-100000, etc). There are ~300 subdirs in use
now.
- Files are never deleted.
- In the future it can happen that files change. my exception  is that
atmost a few thousand files a day will change, scattered over the whole
collection with an emphasis on the most recent files.
- We only need a mirrored version of the collection, no need for weekly
snapshot or incremental snapshot.. just a copy of the collection.
- It takes rsync 30 minutes to check that a directory needs no updating,
It will take longer if it does.
- The collection can not be taken offline during backups, it needs to be
accesible all the time.

What we use now is a rsync on the last directories, this is by the grace
that files are currently immutable. This will change though.

Train of thought:
Remembering that (in the recent future) changes can happen in the whole
collection.
We will need to compare the collection to its back-upped/mirrored
counterpart.
Figuring that running rsync on 300 subdirectory's will take atleast 150
hours makes it not a viable option.
We can assume that the rsync people are smart and that there is no
faster ways to compare such huge collections.

Another option is just copying the whole collection every night.
Assuming a sustained transfer rate of 50MB/sec (20 is more likely i
think, but no experience) it will take 33 hours.
This method generates a huge disk load, which is unacceptable especially
for more then a few hours.

So we cannot construct a list of differences within an acceptable time,
neither can we copy all the files.
Combining these to I got the following brainstorm:
It is in-effecient to reconstruct the changes you made during the day,
why not save this knowledge when its allready available (while
editing/saving the files).
saving all the changed filesnames in a queue/list, allows us to read
this list at the end of the day and only copy the needed files, atmost
60.000 (new + changed).
It is also possible to make this a daemon process, keeping backup
up-to-date almost realtime with relative little load, you could built-in
rates to decrease the performace drain during peak hours.  This could
just work :D

Only problem is constructing the list and capturing the knowledge while
it is available, two options exist:
At system level this can be done using for example I-notify, this
requires a user-daemon. If the daemon crashes changes will be missed
though.
At application (the one making the changes) level this can also be done,
when the application crashes no changes are made, so nothing is missed.
But it does require making the backup dependent on the application. Not
an ideal situation.

Do you have any idea's/hints? maybe options i have not seen, even when
they sound silly. They might trigger a fresh idea :D

[1] http://www-128.ibm.com/developerworks/linux/library/l-inotify.html

with regards,

Jos Houtman
[EMAIL PROTECTED]

--
[email protected] mailing list

[gentoo-server] Mirroring/backing-up a large 15Million (6TB) file collection

Reply via email to