We at Veoh face a similar problem of serving files from a distributed store,
probably on somewhat larger scale than your are looking at.

We evaluated several systems and determined that for our needs, hadoop was
not a good choice.  The advantages of hadoop (map-reduce, blocking) were
substantially out-weighed by issues with scale (we have nearly a billion
files that we need to work with) and hackability (we need to have system
admins be able to modify placement policies).  We also were unable to find
anybody else using hadoop in this way.

Commercial distributed stores were eliminated based on cost alone.

In the end, we chose to use mogile for origin file service and have been
pretty happy with the choice, although we did wind up making significant
changes to the system in order to facilitate scaling.  One major advantage
for the choice of mogile was the fact that we knew that it was already being
used in some high volume sites.


On 3/22/08 10:32 AM, "Robert Krüger" <[EMAIL PROTECTED]> wrote:

> 
> Hi,
> 
> we're currently evaluating the use of Hadoop's HDFS for a project where
> most of the data will be large files and latency will not matter that
> much, so it should be suited perfectly in those cases, but some of the
> data will be smaller files (a few K) and will be accessed quite
> frequently (e.g. by an HTTP server serving user requests, however not
> with a very high load). Now, while the docs state that thruput rather
> than latency was optimized, I am curious as to what order of magnitude
> of an overhead is to be expected for, e.g. reading a file's metadata
> such as modification timestamp or streaming (reading) a small file as a
> whole as compare to things like NFS access or access to theses resources
> via another HTTP server using the same network hardware and setup?
> Paying a 10% penalty in latency compared to NFS would probably be OK but
> at the moment I have no idea if we are talking abou 1, 10 or 100%
> difference.
> 
> Has anyone done similar things, i.e. serve not so large files directly
> from a HDFS cluster via an HTTP server?
> 
> Thanks in advance,
> 
> Robert
> 

Reply via email to