On Jan 22, 2007, at 3:55 AM, Andrzej Bialecki wrote:
Why is that? Is it because people don't know about it, or other
models for tackling out-of-core tasks are more popular, or it's not
applicable to most problems out there? I'm not sure. From my
experience I know that it's often not obvious (and sometimes
impossible?) how to decompose an existing algorithm so that it fits
in the map-reduce paradigm.
I would guess the primary reason is that until Hadoop came along
there wasn't a way to write or run map/reduce applications. Map/
reduce is still pretty new. MPI was first released 12 years ago. PVM
is even older. A lot of the people in the super computing community
are just learning about map/reduce.
Do people use Hadoop for tasks outside the well-known class of web-
related problems?
One obvious application that uses map/reduce is log analysis. If you
have gigabytes or terabytes of logs to scan and analyze, map/reduce
helps a lot.
In terms of non-obvious applications, for one of the Yahoo Hack Days,
I wrote a distributed implementation of dancing links (http://
en.wikipedia.org/wiki/Dancing_Links) solver and a pentomino (http://
en.wikipedia.org/wiki/Pentomino) tiling framework based on it. In
particular, I solved the 9x10 grid using the 18 one sided pentominos.
The approach that I used, expanded the search tree to depth N and
wrote out each of possible prefixes as a line. At a depth of 5, my
problem generated 150,000 prefixes. I used map/reduce and text input
format to split the input space into a bunch of pieces and generate
all of the solutions for each prefix, and a single reduce to collect
solutions.
So the input file looked like:
1,0,0,1,0,
1,0,0,1,1,
1,0,0,1,2,
1,0,0,1,3,
1,0,0,1,4,
1,0,0,1,5,
..
and the output looked like (each letter is piece and the grid shows
the layout):
1,0,0,1,1,
uuuiiiiit
uxupppttt
xxxwppPPt
vxwwffPPn
vwwffYlPn
vvvzfYlnn
yzzzYYlnL
yzZZFYllL
yyZFFFNNL
yZZFNNNLL
1,0,0,1,1,
uuuiiiiit
uxupppttt
xxxwppFNt
vxwwFFFNN
vwwllFnYN
vvvzlnnYN
yzzzlnYYL
yzZZlnfYL
yyZPPPffL
yZZPPffLL
...
I choose that particular problem, because in Knuth's paper on dancing
links (http://xxx.lanl.gov/PS_cache/cs/pdf/0011/0011047.pdf) on page
16, he started to solve it and determined it would take a couple of
months using his solver. Using Hadoop on 20 nodes, it takes a couple
of hours. Fun stuff. I should probably package up the solver and
submit it in the Hadoop examples. As a side benefit, I also did a
sudoku solver using my dancing links solver. There isn't any point in
running the sudoku solver distributed though, since it can solve huge
puzzles (as in 42x42) in less than 1 second.
-- Owen