Thank you very much for sharing your experience! This is very helpful and we will take a loog at mogile.

I have two questions regarding your decision against HDFS. You mention issues with scale regarding the number of files. Could you elaborate a bit? At which orders of magnitude would you expect problems with HDFS and which? Our system will at max have a few million files, so that's at least two orders of magnitude less than in your case.

Did you look at latency of file access at all? Did you run into any issues there?

Thanks,

Robert


Ted Dunning wrote:
We at Veoh face a similar problem of serving files from a distributed store,
probably on somewhat larger scale than your are looking at.

We evaluated several systems and determined that for our needs, hadoop was
not a good choice.  The advantages of hadoop (map-reduce, blocking) were
substantially out-weighed by issues with scale (we have nearly a billion
files that we need to work with) and hackability (we need to have system
admins be able to modify placement policies).  We also were unable to find
anybody else using hadoop in this way.

Commercial distributed stores were eliminated based on cost alone.

In the end, we chose to use mogile for origin file service and have been
pretty happy with the choice, although we did wind up making significant
changes to the system in order to facilitate scaling.  One major advantage
for the choice of mogile was the fact that we knew that it was already being
used in some high volume sites.


On 3/22/08 10:32 AM, "Robert Krüger" <[EMAIL PROTECTED]> wrote:

Hi,

we're currently evaluating the use of Hadoop's HDFS for a project where
most of the data will be large files and latency will not matter that
much, so it should be suited perfectly in those cases, but some of the
data will be smaller files (a few K) and will be accessed quite
frequently (e.g. by an HTTP server serving user requests, however not
with a very high load). Now, while the docs state that thruput rather
than latency was optimized, I am curious as to what order of magnitude
of an overhead is to be expected for, e.g. reading a file's metadata
such as modification timestamp or streaming (reading) a small file as a
whole as compare to things like NFS access or access to theses resources
via another HTTP server using the same network hardware and setup?
Paying a 10% penalty in latency compared to NFS would probably be OK but
at the moment I have no idea if we are talking abou 1, 10 or 100%
difference.

Has anyone done similar things, i.e. serve not so large files directly
from a HDFS cluster via an HTTP server?

Thanks in advance,

Robert




--
(-) Robert Krüger
(-) SIGNAL 7 Gesellschaft für Informationstechnologie mbH
(-) Landwehrstraße 4 - 64293 Darmstadt,
(-) Tel: +49 (0) 6151 969 96 11, Fax: +49 (0) 6151 969 96 29
(-) [EMAIL PROTECTED], www.signal7.de
(-) Amtsgericht Darmstadt, HRB 6833
(-) Geschäftsführer: Robert Krüger, Frank Peters, Jochen Strunk

Reply via email to