Thanks for sharing this - I find these comparisons really interesting.
 I have a small comment after skimming this very quickly.
[Please accept my apologies for commenting on such a trivial thing,
but personal experience has shown this really influences performance]

One thing not touched on in the article is the need for developers to
take into account performance gains when writing MapReduce programs,
much like they do when making sure the DB query optimizer is doing the
join order sensibly.

For example in your MR code you have 2 simple places to improve performance:

  String fields[] = value.toString().split("\\" +
BenchmarkBase.VALUE_DELIMITER);

This will create a String for "\\" + BenchmarkBase.VALUE_DELIMITER,
and then compile a pattern for it, then split the input and then the
new String and the Pattern are god for garbage collection.

String splitting like this is far quicker with a precompiled Pattern
that you reuse:
  static Pattern splitter = Pattern.compile("\\" +
BenchmarkBase.VALUE_DELIMITER);
  ....
  splitter.split(value.toString());

A simple loop of splitting 100000 records has 431msec to 69msec on my
2G macbook pro.  Now consider what happens when splitting Billions of
rows (it only gets worse with a bigger input string).

The other gain is object reusing rather than creation:
  key = new Text(key.toString().substring(0, 7));

Unnecessary Object creation and garbage collection kills Java
performance in any application.

(I haven't seen it in your code, but another performance gain is
reliance on Exceptions where if/else clauses perform far quicker.)

These are really trivial things that people often overlook but when
you are running these operations billions of times it really adds up
and is analogous to using BigInteger on a DB column with an Index
where a SmallInteger will do.

Again - I apologise for commenting on such a trivial thing (I really
feel stupid commenting on how to split a String in Java efficiently to
this mailing list), but might be worth considering when doing these
kind of tests - and like you say RDBMS has 20 years of these
performance tweaks.  Of course the fact that RDBMS mostly shield
people from these low level things is a huge benefit and might be
worth mentioning.

Cheers,

Tim












On Tue, Apr 14, 2009 at 4:16 PM, Guilherme Germoglio
<[email protected]> wrote:
> (Hadoop is used in the benchmarks)
>
> http://database.cs.brown.edu/sigmod09/
>
> There is currently considerable enthusiasm around the MapReduce
> (MR) paradigm for large-scale data analysis [17]. Although the
> basic control flow of this framework has existed in parallel SQL
> database management systems (DBMS) for over 20 years, some
> have called MR a dramatically new computing model [8, 17]. In
> this paper, we describe and compare both paradigms. Furthermore,
> we evaluate both kinds of systems in terms of performance and de-
> velopment complexity. To this end, we define a benchmark con-
> sisting of a collection of tasks that we have run on an open source
> version of MR as well as on two parallel DBMSs. For each task,
> we measure each system’s performance for various degrees of par-
> allelism on a cluster of 100 nodes. Our results reveal some inter-
> esting trade-offs. Although the process to load data into and tune
> the execution of parallel DBMSs took much longer than the MR
> system, the observed performance of these DBMSs was strikingly
> better. We speculate about the causes of the dramatic performance
> difference and consider implementation concepts that future sys-
> tems should take from both kinds of architectures.
>
>
> --
> Guilherme
>
> msn: [email protected]
> homepage: http://germoglio.googlepages.com
>

Reply via email to