Hi Brian, I'm sorry but it is not my paper. :-) I've posted the link here because we're always looking for comparison data -- so, I thought this benchmark would be welcome.
Also, I won't attend the conference. However, it would be a good idea to someone who will to ask directly to the authors all these questions and comments and then post their answers here. On Tue, Apr 14, 2009 at 2:26 PM, Brian Bockelman <[email protected]>wrote: > Hey Guilherme, > > It's good to see comparisons, especially as it helps folks understand > better what tool is the best for their problem. As you show in your paper, > a MapReduce system is hideously bad in performing tasks that column-store > databases were designed for (selecting a single value along an index, > joining tables). > > Some comments: > 1) For some of your graphs, you show Hadoop's numbers in half-grey, > half-white. I can't figure out for the life of me what this signifies! > What have I overlooked? > 2) I see that one of your co-authors is the CEO/inventor of the Vertica DB. > Out of curiosity, how did you interact with Vertica versus Hadoop versus > DBMS-X? Did you get help tuning the systems from the experts? I.e., if you > sat down with a Hadoop expert for a few days, I'm certain you could squeeze > out more performance, just like whenever I sit down with an Oracle DBA for a > few hours, my DB queries are much faster. You touch upon the sociological > issues (having to program your own code versus having to only know SQL, as > well as the comparative time it took to set up the DB) - I'd like to hear > how much time you spent "tweaking" and learning the best practices for the > three, very different approaches. If you added a 5th test, what's the > marginal effort required? > 3) It would be nice to see how some of your more DB-like tasks perform on > something like HBase. That'd be a much more apples-to-apples comparison of > column-store DBMS versus column-store data system, although the HBase work > is just now revving up. I'm a bit uninformed in that area, so I don't have > a good gut in how that'd do. > 4) I think that the UDF aggregation task (calculating the inlink count for > each document in a sample) is interesting - it's a more Map-Reduce oriented > task, and it sounds like it was fairly miserable to hack around the > limitations / bugs in the DBMS. > 5) I really think you undervalue the benefits of replication and > reliability, especially in terms of cost. As someone who helps with a small > site (about 300 machines) that range from commodity workers to Sun Thumpers, > if your site depends on all your storage nodes functioning, then your costs > go way up. You can't make cheap hardware scale unless your software can > account for it. > - Yes, I realize this is a different approach than you take. There are > pros and cons to large expensive hardware versus lots of cheap hardware ... > the argument has been going on since the dawn of time. However, it's a bit > unfair to just outright dismiss one approach. I am a bit wary of the claims > that your results can scale up to Google/Yahoo scale, but I do agree that > there are truly few users that are that large! > > I love your last paragraph, it's a very good conclusion. It kind of > reminds me of the grid computing field which was (is?) completely shocked by > the emergence of cloud computing. After you cut through the hype > surrounding the new fads, you find (a) that there are some very good reasons > that the fads are popular - they have definite strengths that the existing > field was missing (or didn't want to hear) and (b) there's a lot of common > ground and learning that has to be done, even to get a good common > terminology :) > > Enjoy your conference! > > Brian > > On Apr 14, 2009, at 9:16 AM, Guilherme Germoglio wrote: > > (Hadoop is used in the benchmarks) >> >> http://database.cs.brown.edu/sigmod09/ >> >> There is currently considerable enthusiasm around the MapReduce >> (MR) paradigm for large-scale data analysis [17]. Although the >> basic control flow of this framework has existed in parallel SQL >> database management systems (DBMS) for over 20 years, some >> have called MR a dramatically new computing model [8, 17]. In >> this paper, we describe and compare both paradigms. Furthermore, >> we evaluate both kinds of systems in terms of performance and de- >> velopment complexity. To this end, we define a benchmark con- >> sisting of a collection of tasks that we have run on an open source >> version of MR as well as on two parallel DBMSs. For each task, >> we measure each system’s performance for various degrees of par- >> allelism on a cluster of 100 nodes. Our results reveal some inter- >> esting trade-offs. Although the process to load data into and tune >> the execution of parallel DBMSs took much longer than the MR >> system, the observed performance of these DBMSs was strikingly >> better. We speculate about the causes of the dramatic performance >> difference and consider implementation concepts that future sys- >> tems should take from both kinds of architectures. >> >> >> -- >> Guilherme >> >> msn: [email protected] >> homepage: http://germoglio.googlepages.com >> > > -- Guilherme msn: [email protected] homepage: http://germoglio.googlepages.com
