Mark Smith wrote:
>  I've been playing with MogileFS for quite some time in testing, but
>  I'm getting ready to actually do something with it now and have been
>  searching for some "best practices" regarding serving the stored
>  mogileFS files.  I've come up empty searching for this information
>  aside from some mailing list posts, so I figure that I just need to
>  get some input, write up my assumptions and then I'd like to post to
>  the wiki.  If this is good enough, who do I ask for an account to
>  update the wiki (seems I need an invite key)?  After it is up there, I
>  hope people can just make incremental revisions to get it into better
>  shape.

I have no idea about the MogileFS wiki... we recently put something for Perlbal on Google Code, might be useful to do the same thing for MogileFS... Brad?

>  To begin with, it is recommended to write an "interpretation" layer
>  that translates the end-user URL to an internal mogileFS layer.  For
>  this example, a url like
>  "http://static.myapp.com/users/1234567-thumb.jpg"; would then be
>  translated to a mogileFS stored key, along with the class.  To keep
>  this simple, the key could be "1234567:thumb" and the class would be
>  "users".

Yes, typically this is in your backend webservers.  See below.

>  The next step is to write some code that takes the key and class,
>  fetches the paths from a tracker and then, in some fashion, proxies
>  the image response.

Perlbal is the recommended way of doing this, actually, as it does most of the work for you.

>  There are two distinct points where caching can be utilized to speed
>  up the serving of files.  Up front, you can cache the entire URL (so
>  that if http://static.myapp.com/users/1234567-thumb.jpg has been seen,
>  it returns the last response, just like any other image proxy would
>  work).  Secondarily, cache the resulting URLs returned from the
>  tracker.  This is where memcached plays in nicely together with
>  mogileFS.

Perlbal does the second kind of caching in itself, and the first kind can be put up front with squid/other types of caches and has been done with success.

>  I'd also like to provide some sample code for fetching the paths and
>  reproxying, but I'm uncertain of the best way to do this and I'm
>  hoping others can help.  If you just forward the request to the
>  internal tracker URL, I don't quite see how to set the mime-type and
>  other headers (perhaps this is something easy in perlbal?  I have to
>  admit my ignorance on this.)

Actually this is something done at the backend. Perlbal doesn't know what kind of file it is, neither does MogileFS. You're expected to have some sort of logic in your application that sends back the proper headers for the file you want to reproxy.

>  The other point that I'd like to talk about is an image manipulation
>  layer.  Something like how most image hosting services do that checks
>  the incoming referrer header.  If the referrer is blank, or the host
> is not equal to "myapp.com <http://myapp.com>", then affix a "stamp" on to the top of the
>  image.  Has there been any thoughts for a mogileFS Cookbook setup?  I
>  think it would greatly help out newcomers to the product.

Easily done in the traditional setup.

So, enough about that - this is traditionally how MogileFS is used. There are other ways of using it, but it's sort of designed around the idea of using Perlbal, so many of the pieces fit more neatly if you do.

                                 /->  [ webserver ] -> [ mogilefsd ]
                                /                           |
{ internet } -> [ perlbal ] -> <                            /
                                \                          /
                                 \->  [ mogstored ] <-----/

I don't know if that will translate well.  There are graphics somewhere...

Anyway. So Perlbal answers all incoming requests. It talks to your webservers and your MogileFS storage nodes (mogstored/lighttpd/whatever you use as the storage server component). Your webserver talks to the MogileFS trackers (mogilefsd instances). The trackers talk to the storage nodes.

Let's walk through the request, something like:

------
1) GET /some/path/somefile-123.jpg arrives at Perlbal.

In the very simple case, let's ignore caching. At this point Perlbal will play this request out to a backend - acting as a reverse proxy in this case. The request gets sent to a backend webserver.

2) GET /some/path/somefile-123.jpg arrives at backend webserver.

This is your custom application. Again, I'm going to simplify slightly. You get a request for that URL, you know that it's in MogileFS. So you translate it into the proper path and send the request to the tracker.

3) get_paths somefile:123 arrives at MogileFS tracker

The tracker does its own magic to determine where the file is. It will return some internal path.

4) Webserver now knows that the file is at http://10.0.0.34:8034/dev1/0/0/0/0.fid (plus another URL or two).

Your application now constructs the headers needed - presumably you either store some metadata locally (the LiveJournal approach, the application stores a table containing things like - file format, size, etc) or you assume it from the URL the user gave you.

Either way, you only need to return the Content-Type header? There might be one or two more... Perlbal will handle Content-Length and the like.

So you return the response to Perlbal, with X-REPROXY-URL: internal URL list.

5) Perlbal gets back the response from your webserver.

This URL is now retrieved internally and Perlbal starts reproxying it. The headers returned to the user are some combination of the response from your webserver and the response from the storage node. (Sorry for not being very descriptive here... I don't remember exactly which headers are pulled from where.)

6) User gets their file.
------

Now, there are some obvious spots for caching. Perlbal can do caching based on the URL, skipping the entire hit to your webserver. This works great for resources that are public and don't need access control. If you do, you can cache at our own application layer, and skip hitting MogileFS.

You can also cache the file itself out front in something like squid, that works too. Same sort of access control questions though, but you can get inventive about that if you want. (.htaccess with a password on user directories or something?)

Anyway, this is somewhat rambling. I and others are happy to answer questions, just fire away. If you really want to work on documentation, I'd be happy to buy you a beer. It's something sorely sorely lacking.


--
Mark Smith / xb95
[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
Awesome guide!
Where was this guide 6months ago? :(

Put that online somewhere ... would have saved me quite some time searching and guessing.

about the walkthrough point 4)
would love to have the content-type of the file stored by mogilefsd.
otherwise i have to check the content-type of that file myself or encode the information about the content-type into the domain/class or the request url.

for example using:
domain: image
class: jpeg
-> content-type: image/jpeg


or url: /i/j/userphoto_1234
-> content-type: image/jpeg

but as mogilefs has to touch that file anyway it would be a good place to store the content-type (as it stores the size of the file)

Reply via email to