Mark Stosberg wrote:
This question isn't so much a mod_perl question, as it is a question
about building high performance websites with Perl.
We have a large, busy, database application that relates to millions of
photos, which we also need to store and display. We've been keeping the
meta data about the photos in PostgreSQL, and the files on the file
system. (Implemented much like CGI::Uploader ).
This has worked great in terms of performance, but with so much data to
manage, over time we have run into data inconsistency issues between the
file system and the database.
So, I'm asking if anyone has had experience successfully storing photos
(or othe files) directly in database? That would solve the consistency
issue, but may create a performance issue. Perhaps the right kind of
caching layer could solve that.
I am curious about your application, because we have something similar
and similar volumes, not with photos but with documents in general.
The following is an opinion piece.
We have also looked at various data organisations over time.
Regarding storing large objects directly in a database, the one issue is
always that (because of the object sizes), it makes any operation on the
rows of such a database or table very heavy. Imagine having to dump or
reload a table that contains 500,000 "blobs" of 2-3 MB each.
(Don't know about PostgreSQL, but many db systems require a dump and a
reload when you change a table structure).
Or simply take a backup of that table. And you cannot make an "rsync"
of a database as easily as of a filesystem.
It also means that any intermediate buffer (which are often used to
improve retrieval of "nearby" rows) is quickly full, with few rows in it.
(On the other hand, if you just keep thumbnails of a couple of Kb, I
guess it would not matter much.)
Another issue is that it happens that databases get screwed up, and the
likelihood probably increases as you push them to their limits (for
example with very large row sizes). Resolving some inconsistencies
between database rows and files on disk may be no fun, but resolving
inconsistencies within a database may be even less so.
One point I am curious about, is what kind of file structure you use to
store the millions of images on the filesystem. I can't imagine that
you do it really into one flat directory ?
And are you storing the real paths directly in the database ?
To get back to your issue of inconsistency : maybe the best strategy is
just to check for such inconsistencies as early as possible ? For
example, when you add or delete objects, write this information
somewhere in a daily "transactions" file, which is then analysed at
night by some job which checks that everything is really where it is
supposed to be, and lets you know when not.
Regarding efficiency : when you think about it, a modern filesystem is
about the fastest, most efficient in space and most reliable database
system that one can think of, with the lowest overhead, as long as you
know the exact path of an object, and as long as all the directories in
the path are kept to a reasonable size (important). It has its inherent
buffering at various levels, optimised to access files. It has a whole
bunch of utilities to manipulate it; it it shareable, but can be locked
when you need it. It is portable.
It does have one inconvenient : it has a single "key" to access an
object (the path). But for that, you have your database system.
Oh, and I've thought of another advantage, in an Apache/web context : to
send the content of a file to a browser, you can take advantage of the
sendfile() call, which is very efficient. Now if your file is a blob in
a row of a database, you have to read it yourself in memory, and send
it, don't you ?
I've just re-convinced myself not to change our storage structure.