Re: [gentoo-server][OT] Mirroring/backing-up a large 15Million (6TB) file collection

jos houtman Wed, 19 Apr 2006 08:45:39 -0700

Thank you all for the reply's. let me answer them in one reply.


Björn Gustafsson wrote:

Now what kind of systems are these? Home-grown arrays or "real" ones?In the latter case, are there no vendor-provided approaches to this?

These are what you would call homegrown, we are now looking at moreserieus systems. but they also cost serious money..

I'm not sure how this would apply to regular filesystems (no ideawhich one you use though), but in "larger" (not size-wise) systems, abitmap of the filesystem is kept in a separate location separate, anddisk areas with changed or added files are marked as dirty, andtransferred to the remote host either immediately (with synchronousi/o), as soon as possible (async i/o), or when requested (veeeeryasync i/o ;)). This is rather effective system, with the backup speedmainly dependent on the size you would choose for the bitmap (largebitmap => smaller blocks => potentially less data) and transferspeed.. Restructuring of data on the physical disk would also create amajor update of blocks to be transferred.

hmm i like that, is it possible to filter the things that aretransferred? for example i dont want to mirror deletions.

Marking folders as dirty is another solution, however 50k files is abit big. Implementing dirty files in chunks of say 50 or 100 would bea half-way solution, but that'd be dependant on the application [seebelow].

another nice suggestion, if the list of individual edited files isgetting to big, we can indeed start working with groups..



From Alex efros:

In your case that mean, for example: it's probably best solution to
backup issue to change a way how files changed so what changed files
isn't really CHANGED, but instead new version is just ADDED to collection.
This way it will be enough for you to just remember which file was
backuped last by previous backup and on next backup continue from that
file (I suppose all your files are numbered: "(1-50000,50001-100000, etc)").

This way backup will not depend on collection size (only on amount of
added files) and will not depend on some "special feature" in application
(like constructing list of changed files) which may have bugs.

I really like this solution. It has several advantages:
- it's really simple
- it requires no interaction with the application.

- it gives a little overhead in diskspace, but thats probably negligableto the 20/30% of picture we dont need, but just don't delete. we dontwant to take the risk of deleting wrong files. and ofc if ever theservice is misused, we still have evidence.

- no need to construct a list. just a last-backupped-photo-id somewhere.


From Mikey:

Certainly every file is not linked directly in a web page?

Why not keep links in a database that point to the correct location on disk
for the images themselves?  Then all you need to do is query the database
for a timestamp field that has changed and you know what you need to back
up, and at any time you can move the underlying files around and update the
links in the database...

You can be fairly sure that atleast 80% of the photo's is accessible onthe website.I dont really understand you here, but i think we allready have what youmean.

but for completeness, this is abit how the system works.

We allready keep a record of the photo's in the db forbookkeeping/userinfo/accessrights/albums/etc... etc...the actual location of the image-file is determined by the id plus asecret. so image 11809373 can be accessed usinghttp://interval1.rendered.startpda.net/11800001-11850000/11809373_120_120_sfi5.jpeg.This allows to do resizing (120_120), provide the content with a simplesystem of apache servers and squids.


jos
--
[email protected] mailing list

Re: [gentoo-server][OT] Mirroring/backing-up a large 15Million (6TB) file collection

Reply via email to