A few million files should fit pretty easily in hdfs. One problem is that hadoop is not designed with full high availability in mind. Mogile is easier to adapt to those needs.
Latency in our case is much less critical than for many applications since we are talking about web service and are only the origin point for a content delivery network (or three). On 3/24/08 2:44 AM, "Robert Krüger" <[EMAIL PROTECTED]> wrote: > > Thank you very much for sharing your experience! This is very helpful > and we will take a loog at mogile. > > I have two questions regarding your decision against HDFS. You mention > issues with scale regarding the number of files. Could you elaborate a > bit? At which orders of magnitude would you expect problems with HDFS > and which? Our system will at max have a few million files, so that's at > least two orders of magnitude less than in your case. > > Did you look at latency of file access at all? Did you run into any > issues there? > > Thanks, > > Robert > > > Ted Dunning wrote: >> We at Veoh face a similar problem of serving files from a distributed store, >> probably on somewhat larger scale than your are looking at. >> >> We evaluated several systems and determined that for our needs, hadoop was >> not a good choice. The advantages of hadoop (map-reduce, blocking) were >> substantially out-weighed by issues with scale (we have nearly a billion >> files that we need to work with) and hackability (we need to have system >> admins be able to modify placement policies). We also were unable to find >> anybody else using hadoop in this way. >> >> Commercial distributed stores were eliminated based on cost alone. >> >> In the end, we chose to use mogile for origin file service and have been >> pretty happy with the choice, although we did wind up making significant >> changes to the system in order to facilitate scaling. One major advantage >> for the choice of mogile was the fact that we knew that it was already being >> used in some high volume sites. >> >> >> On 3/22/08 10:32 AM, "Robert Krüger" <[EMAIL PROTECTED]> wrote: >> >>> Hi, >>> >>> we're currently evaluating the use of Hadoop's HDFS for a project where >>> most of the data will be large files and latency will not matter that >>> much, so it should be suited perfectly in those cases, but some of the >>> data will be smaller files (a few K) and will be accessed quite >>> frequently (e.g. by an HTTP server serving user requests, however not >>> with a very high load). Now, while the docs state that thruput rather >>> than latency was optimized, I am curious as to what order of magnitude >>> of an overhead is to be expected for, e.g. reading a file's metadata >>> such as modification timestamp or streaming (reading) a small file as a >>> whole as compare to things like NFS access or access to theses resources >>> via another HTTP server using the same network hardware and setup? >>> Paying a 10% penalty in latency compared to NFS would probably be OK but >>> at the moment I have no idea if we are talking abou 1, 10 or 100% >>> difference. >>> >>> Has anyone done similar things, i.e. serve not so large files directly >>> from a HDFS cluster via an HTTP server? >>> >>> Thanks in advance, >>> >>> Robert >>> >> >