Hi. Comments inline
Cheers //Marcus On Fri, Jul 3, 2009 at 4:48 PM, Steve Loughran <[email protected]> wrote: > Marcus Herou wrote: > >> Hi. >> >> This is my company so I reveal what I like, even though the board would >> shoot me but hey do you think they are scanning this mailinglist ? :) >> >> The PR algo is very simple (but clever) and can be found on wikipedia: >> http://en.wikipedia.org/wiki/PageRank >> What is painful is to calculate it in a distributed architecture. You will >> never achieve your goal by using a DB to store the score/node and links >> from/to it (we did not at least). >> We use plain lucene indexes and 10 memcached servers to store the >> intermediate scoring and run enough iterations for the scoring to almost >> converge (it never converges completely). >> >> > memcached? why not store the intermediate values in the MR FS? Why do you think I chose memcached ? It was not due to the nice API... Doh! Performance of course. It beats the hell out of HDFS for small temporarily stored data (that's why it's called a cache and not FS). By using memcached we again have the matter of using something which is not shared-nothing but nicely distributed and have a great performance for the workload. HBase cannot compete either in the <1ms range > > > Here's Paolo Castagna's MR-based implementation (some logging omitted) > > - first clean the data up > -count the files and give them all an initial score > -update the ranks, dealing with dangling pointers > -repeat until you are happy. > -sort and display the output > > No DB, just HDFS > > exec("Data cleanup", new CheckingData(), inpath, output); > //count the data > String countFile = outpath + File.separator + COUNT; > exec("Page count", new CountPages(), output, countFile); > String count = read(fs, countFile); > exec("InitializeRanks", new InitializeRanks(), output, input, > count); > int iterations = 0; > String sortedRanksDir = outpath + File.separator + > SORTED_RANKS; > while (iterations < iterationLimit) { > String danglingFile = outpath + File.separator + DANGLING; > exec("DanglingPages", new DanglingPages(), input, danglingFile); > String dangling = read(fs, danglingFile); > exec("UpdateRanks", new UpdateRanks(), input, output, count, > dangling); > overwrite(fs, new Path(output), new Path(input)); > if ((iterations > CHECK_CONVERGENCE_FREQUENCY) > && (iterations % CHECK_CONVERGENCE_FREQUENCY == 0)) { > String convergenceFile = outpath + File.separator + > CONVERGENCE; > exec("CheckConvergence", new CheckConvergence(), > input, > convergenceFile); > double tolerance = Double.parseDouble(read(fs, > convergenceFile)); > > if (tolerance <= toleranceArg) { > //exit the loop when we are happy > break; > } > } > > iterations++; > } > > exec("SortRanks", new SortRanks(), input, sortedRanksDir); > > > > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 [email protected] http://www.tailsweep.com/
