Thank you all for the reply's. let me answer them in one reply.
Björn Gustafsson wrote:
Now what kind of systems are these? Home-grown arrays or "real" ones?
In the latter case, are there no vendor-provided approaches to this?
These are what you would call homegrown, we are now looking at more
serieus systems. but they also cost serious money..
I'm not sure how this would apply to regular filesystems (no idea
which one you use though), but in "larger" (not size-wise) systems, a
bitmap of the filesystem is kept in a separate location separate, and
disk areas with changed or added files are marked as dirty, and
transferred to the remote host either immediately (with synchronous
i/o), as soon as possible (async i/o), or when requested (veeeery
async i/o ;)). This is rather effective system, with the backup speed
mainly dependent on the size you would choose for the bitmap (large
bitmap => smaller blocks => potentially less data) and transfer
speed.. Restructuring of data on the physical disk would also create a
major update of blocks to be transferred.
hmm i like that, is it possible to filter the things that are
transferred? for example i dont want to mirror deletions.
Marking folders as dirty is another solution, however 50k files is a
bit big. Implementing dirty files in chunks of say 50 or 100 would be
a half-way solution, but that'd be dependant on the application [see
below].
another nice suggestion, if the list of individual edited files is
getting to big, we can indeed start working with groups..
From Alex efros:
In your case that mean, for example: it's probably best solution to
backup issue to change a way how files changed so what changed files
isn't really CHANGED, but instead new version is just ADDED to collection.
This way it will be enough for you to just remember which file was
backuped last by previous backup and on next backup continue from that
file (I suppose all your files are numbered: "(1-50000,50001-100000, etc)").
This way backup will not depend on collection size (only on amount of
added files) and will not depend on some "special feature" in application
(like constructing list of changed files) which may have bugs.
I really like this solution. It has several advantages:
- it's really simple
- it requires no interaction with the application.
- it gives a little overhead in diskspace, but thats probably negligable
to the 20/30% of picture we dont need, but just don't delete. we dont
want to take the risk of deleting wrong files. and ofc if ever the
service is misused, we still have evidence.
- no need to construct a list. just a last-backupped-photo-id somewhere.
From Mikey:
Certainly every file is not linked directly in a web page?
Why not keep links in a database that point to the correct location on disk
for the images themselves? Then all you need to do is query the database
for a timestamp field that has changed and you know what you need to back
up, and at any time you can move the underlying files around and update the
links in the database...
You can be fairly sure that atleast 80% of the photo's is accessible on
the website.
I dont really understand you here, but i think we allready have what you
mean.
but for completeness, this is abit how the system works.
We allready keep a record of the photo's in the db for
bookkeeping/userinfo/accessrights/albums/etc... etc...
the actual location of the image-file is determined by the id plus a
secret. so image 11809373 can be accessed using
http://interval1.rendered.startpda.net/11800001-11850000/11809373_120_120_sfi5.jpeg.
This allows to do resizing (120_120), provide the content with a simple
system of apache servers and squids.
jos
--
[email protected] mailing list