Re: Input on best practices for serving files

sanados Wed, 09 Apr 2008 22:38:44 -0700

Mark Smith wrote:

>  I've been playing with MogileFS for quite some time in testing, but
>  I'm getting ready to actually do something with it now and have been
>  searching for some "best practices" regarding serving the stored
>  mogileFS files.  I've come up empty searching for this information
>  aside from some mailing list posts, so I figure that I just need to
>  get some input, write up my assumptions and then I'd like to post to
>  the wiki.  If this is good enough, who do I ask for an account to
>  update the wiki (seems I need an invite key)?  After it is up there, I
>  hope people can just make incremental revisions to get it into better
>  shape.
I have no idea about the MogileFS wiki... we recently put somethingfor Perlbal on Google Code, might be useful to do the same thing forMogileFS... Brad?
>  To begin with, it is recommended to write an "interpretation" layer
>  that translates the end-user URL to an internal mogileFS layer.  For
>  this example, a url like
>  "http://static.myapp.com/users/1234567-thumb.jpg"; would then be
>  translated to a mogileFS stored key, along with the class.  To keep
>  this simple, the key could be "1234567:thumb" and the class would be
>  "users".

Yes, typically this is in your backend webservers.  See below.

>  The next step is to write some code that takes the key and class,
>  fetches the paths from a tracker and then, in some fashion, proxies
>  the image response.
Perlbal is the recommended way of doing this, actually, as it doesmost of the work for you.
>  There are two distinct points where caching can be utilized to speed
>  up the serving of files.  Up front, you can cache the entire URL (so
>  that if http://static.myapp.com/users/1234567-thumb.jpg has been seen,
>  it returns the last response, just like any other image proxy would
>  work).  Secondarily, cache the resulting URLs returned from the
>  tracker.  This is where memcached plays in nicely together with
>  mogileFS.
Perlbal does the second kind of caching in itself, and the first kindcan be put up front with squid/other types of caches and has been donewith success.
>  I'd also like to provide some sample code for fetching the paths and
>  reproxying, but I'm uncertain of the best way to do this and I'm
>  hoping others can help.  If you just forward the request to the
>  internal tracker URL, I don't quite see how to set the mime-type and
>  other headers (perhaps this is something easy in perlbal?  I have to
>  admit my ignorance on this.)
Actually this is something done at the backend. Perlbal doesn't knowwhat kind of file it is, neither does MogileFS. You're expected tohave some sort of logic in your application that sends back the properheaders for the file you want to reproxy.
>  The other point that I'd like to talk about is an image manipulation
>  layer.  Something like how most image hosting services do that checks
>  the incoming referrer header.  If the referrer is blank, or the host
> is not equal to "myapp.com <http://myapp.com>", then affix a"stamp" on to the top of the
>  image.  Has there been any thoughts for a mogileFS Cookbook setup?  I
>  think it would greatly help out newcomers to the product.

Easily done in the traditional setup.
So, enough about that - this is traditionally how MogileFS is used.There are other ways of using it, but it's sort of designed aroundthe idea of using Perlbal, so many of the pieces fit more neatly ifyou do.
                                 /->  [ webserver ] -> [ mogilefsd ]
                                /                           |
{ internet } -> [ perlbal ] -> <                            /
                                \                          /
                                 \->  [ mogstored ] <-----/

I don't know if that will translate well.  There are graphics somewhere...
Anyway. So Perlbal answers all incoming requests. It talks to yourwebservers and your MogileFS storage nodes(mogstored/lighttpd/whatever you use as the storage servercomponent). Your webserver talks to the MogileFS trackers (mogilefsdinstances). The trackers talk to the storage nodes.
Let's walk through the request, something like:

------
1) GET /some/path/somefile-123.jpg arrives at Perlbal.
In the very simple case, let's ignore caching. At this point Perlbalwill play this request out to a backend - acting as a reverse proxy inthis case. The request gets sent to a backend webserver.
2) GET /some/path/somefile-123.jpg arrives at backend webserver.
This is your custom application. Again, I'm going to simplifyslightly. You get a request for that URL, you know that it's inMogileFS. So you translate it into the proper path and send therequest to the tracker.
3) get_paths somefile:123 arrives at MogileFS tracker
The tracker does its own magic to determine where the file is. Itwill return some internal path.
4) Webserver now knows that the file is athttp://10.0.0.34:8034/dev1/0/0/0/0.fid (plus another URL or two).
Your application now constructs the headers needed - presumably youeither store some metadata locally (the LiveJournal approach, theapplication stores a table containing things like - file format, size,etc) or you assume it from the URL the user gave you.
Either way, you only need to return the Content-Type header? Theremight be one or two more... Perlbal will handle Content-Length and thelike.
So you return the response to Perlbal, with X-REPROXY-URL: internalURL list.
5) Perlbal gets back the response from your webserver.
This URL is now retrieved internally and Perlbal starts reproxyingit. The headers returned to the user are some combination of theresponse from your webserver and the response from the storage node.(Sorry for not being very descriptive here... I don't remember exactlywhich headers are pulled from where.)
6) User gets their file.
------
Now, there are some obvious spots for caching. Perlbal can do cachingbased on the URL, skipping the entire hit to your webserver. Thisworks great for resources that are public and don't need accesscontrol. If you do, you can cache at our own application layer, andskip hitting MogileFS.
You can also cache the file itself out front in something like squid,that works too. Same sort of access control questions though, but youcan get inventive about that if you want. (.htaccess with a passwordon user directories or something?)
Anyway, this is somewhat rambling. I and others are happy to answerquestions, just fire away. If you really want to work ondocumentation, I'd be happy to buy you a beer. It's something sorelysorely lacking.
--
Mark Smith / xb95
[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>

Awesome guide!
Where was this guide 6months ago? :(

Put that online somewhere ... would have saved me quite some timesearching and guessing.


about the walkthrough point 4)
would love to have the content-type of the file stored by mogilefsd.

otherwise i have to check the content-type of that file myself or encodethe information about the content-type into the domain/class or therequest url.


for example using:
domain: image
class: jpeg
-> content-type: image/jpeg


or url: /i/j/userphoto_1234
-> content-type: image/jpeg

but as mogilefs has to touch that file anyway it would be a good placeto store the content-type (as it stores the size of the file)

Re: Input on best practices for serving files

Reply via email to