Marcus Herou wrote:
Hi.

Comments inline

Cheers
//Marcus
On Fri, Jul 3, 2009 at 4:48 PM, Steve Loughran <[email protected]> wrote:

Marcus Herou wrote:

Hi.

This is my company so I reveal what I like, even though the board would
shoot me but hey do you think they are scanning this mailinglist ? :)

The PR algo is very simple (but clever) and can be found on wikipedia:
http://en.wikipedia.org/wiki/PageRank
What is painful is to calculate it in a distributed architecture. You will
never achieve your goal by using a DB to store the score/node and links
from/to it (we did not at least).
We use plain lucene indexes and 10 memcached servers to store the
intermediate scoring and run enough iterations for the scoring to almost
converge (it never converges completely).


memcached? why not store the intermediate values in the MR FS?

Why do you think I chose memcached ? It was not due to the nice API... Doh!
Performance of course. It beats the hell out of HDFS for small temporarily
stored data (that's why it's called a cache and not FS). By using memcached
we again have the matter of using something which is not shared-nothing but
nicely distributed and have a great performance for the workload. HBase
cannot compete either in the <1ms range

OK. I think memcached is best in the front end for that latency stuff, so it was unusual to find it in the back as well. But at least it eliminates another SPOF, and its licensing costs are minimal :)

-steve

Reply via email to