Mark Stosberg wrote:
This question isn't so much a mod_perl question, as it is a question
about building high performance websites with Perl.
We have a large, busy, database application that relates to millions of
photos, which we also need to store and display. We've been keeping the
meta data about the photos in PostgreSQL, and the files on the file
system. (Implemented much like CGI::Uploader ).

This has worked great in terms of performance, but with so much data to
manage, over time we have run into data inconsistency issues between the
file system and the database.

So, I'm asking if anyone has had experience successfully storing photos
(or othe files) directly in database? That would solve the consistency
issue, but may create a performance issue. Perhaps the right kind of
caching layer could solve that.

I am curious about your application, because we have something similar and similar volumes, not with photos but with documents in general.

The following is an opinion piece.

We have also looked at various data organisations over time.
Regarding storing large objects directly in a database, the one issue is always that (because of the object sizes), it makes any operation on the rows of such a database or table very heavy. Imagine having to dump or reload a table that contains 500,000 "blobs" of 2-3 MB each. (Don't know about PostgreSQL, but many db systems require a dump and a reload when you change a table structure). Or simply take a backup of that table. And you cannot make an "rsync" of a database as easily as of a filesystem. It also means that any intermediate buffer (which are often used to improve retrieval of "nearby" rows) is quickly full, with few rows in it. (On the other hand, if you just keep thumbnails of a couple of Kb, I guess it would not matter much.)

Another issue is that it happens that databases get screwed up, and the likelihood probably increases as you push them to their limits (for example with very large row sizes). Resolving some inconsistencies between database rows and files on disk may be no fun, but resolving inconsistencies within a database may be even less so.

One point I am curious about, is what kind of file structure you use to store the millions of images on the filesystem. I can't imagine that you do it really into one flat directory ?
And are you storing the real paths directly in the database ?

To get back to your issue of inconsistency : maybe the best strategy is just to check for such inconsistencies as early as possible ? For example, when you add or delete objects, write this information somewhere in a daily "transactions" file, which is then analysed at night by some job which checks that everything is really where it is supposed to be, and lets you know when not.

Regarding efficiency : when you think about it, a modern filesystem is about the fastest, most efficient in space and most reliable database system that one can think of, with the lowest overhead, as long as you know the exact path of an object, and as long as all the directories in the path are kept to a reasonable size (important). It has its inherent buffering at various levels, optimised to access files. It has a whole bunch of utilities to manipulate it; it it shareable, but can be locked when you need it. It is portable. It does have one inconvenient : it has a single "key" to access an object (the path). But for that, you have your database system.

Oh, and I've thought of another advantage, in an Apache/web context : to send the content of a file to a browser, you can take advantage of the sendfile() call, which is very efficient. Now if your file is a blob in a row of a database, you have to read it yourself in memory, and send it, don't you ?

I've just re-convinced myself not to change our storage structure.

Reply via email to