Hi Brian,

I'm sorry but it is not my paper. :-) I've posted the link here because
we're always looking for comparison data -- so, I thought this benchmark
would be welcome.

Also, I won't attend the conference. However, it would be a good idea to
someone who will to ask directly to the authors all these questions and
comments and then post their answers here.


On Tue, Apr 14, 2009 at 2:26 PM, Brian Bockelman <[email protected]>wrote:

> Hey Guilherme,
>
> It's good to see comparisons, especially as it helps folks understand
> better what tool is the best for their problem.  As you show in your paper,
> a MapReduce system is hideously bad in performing tasks that column-store
> databases were designed for (selecting a single value along an index,
> joining tables).
>
> Some comments:
> 1) For some of your graphs, you show Hadoop's numbers in half-grey,
> half-white.  I can't figure out for the life of me what this signifies!
>  What have I overlooked?
> 2) I see that one of your co-authors is the CEO/inventor of the Vertica DB.
>  Out of curiosity, how did you interact with Vertica versus Hadoop versus
> DBMS-X?  Did you get help tuning the systems from the experts?  I.e., if you
> sat down with a Hadoop expert for a few days, I'm certain you could squeeze
> out more performance, just like whenever I sit down with an Oracle DBA for a
> few hours, my DB queries are much faster.  You touch upon the sociological
> issues (having to program your own code versus having to only know SQL, as
> well as the comparative time it took to set up the DB) - I'd like to hear
> how much time you spent "tweaking" and learning the best practices for the
> three, very different approaches.  If you added a 5th test, what's the
> marginal effort required?
> 3) It would be nice to see how some of your more DB-like tasks perform on
> something like HBase.  That'd be a much more apples-to-apples comparison of
> column-store DBMS versus column-store data system, although the HBase work
> is just now revving up.  I'm a bit uninformed in that area, so I don't have
> a good gut in how that'd do.
> 4) I think that the UDF aggregation task (calculating the inlink count for
> each document in a sample) is interesting - it's a more Map-Reduce oriented
> task, and it sounds like it was fairly miserable to hack around the
> limitations / bugs in the DBMS.
> 5) I really think you undervalue the benefits of replication and
> reliability, especially in terms of cost.  As someone who helps with a small
> site (about 300 machines) that range from commodity workers to Sun Thumpers,
> if your site depends on all your storage nodes functioning, then your costs
> go way up.  You can't make cheap hardware scale unless your software can
> account for it.
>  - Yes, I realize this is a different approach than you take.  There are
> pros and cons to large expensive hardware versus lots of cheap hardware ...
> the argument has been going on since the dawn of time.  However, it's a bit
> unfair to just outright dismiss one approach.  I am a bit wary of the claims
> that your results can scale up to Google/Yahoo scale, but I do agree that
> there are truly few users that are that large!
>
> I love your last paragraph, it's a very good conclusion.  It kind of
> reminds me of the grid computing field which was (is?) completely shocked by
> the emergence of cloud computing.  After you cut through the hype
> surrounding the new fads, you find (a) that there are some very good reasons
> that the fads are popular - they have definite strengths that the existing
> field was missing (or didn't want to hear) and (b) there's a lot of common
> ground and learning that has to be done, even to get a good common
> terminology :)
>
> Enjoy your conference!
>
> Brian
>
> On Apr 14, 2009, at 9:16 AM, Guilherme Germoglio wrote:
>
>  (Hadoop is used in the benchmarks)
>>
>> http://database.cs.brown.edu/sigmod09/
>>
>> There is currently considerable enthusiasm around the MapReduce
>> (MR) paradigm for large-scale data analysis [17]. Although the
>> basic control flow of this framework has existed in parallel SQL
>> database management systems (DBMS) for over 20 years, some
>> have called MR a dramatically new computing model [8, 17]. In
>> this paper, we describe and compare both paradigms. Furthermore,
>> we evaluate both kinds of systems in terms of performance and de-
>> velopment complexity. To this end, we define a benchmark con-
>> sisting of a collection of tasks that we have run on an open source
>> version of MR as well as on two parallel DBMSs. For each task,
>> we measure each system’s performance for various degrees of par-
>> allelism on a cluster of 100 nodes. Our results reveal some inter-
>> esting trade-offs. Although the process to load data into and tune
>> the execution of parallel DBMSs took much longer than the MR
>> system, the observed performance of these DBMSs was strikingly
>> better. We speculate about the causes of the dramatic performance
>> difference and consider implementation concepts that future sys-
>> tems should take from both kinds of architectures.
>>
>>
>> --
>> Guilherme
>>
>> msn: [email protected]
>> homepage: http://germoglio.googlepages.com
>>
>
>


-- 
Guilherme

msn: [email protected]
homepage: http://germoglio.googlepages.com

Reply via email to