Marcus Herou wrote:
Hi.

This is my company so I reveal what I like, even though the board would
shoot me but hey do you think they are scanning this mailinglist ? :)

The PR algo is very simple (but clever) and can be found on wikipedia:
http://en.wikipedia.org/wiki/PageRank
What is painful is to calculate it in a distributed architecture. You will
never achieve your goal by using a DB to store the score/node and links
from/to it (we did not at least).
We use plain lucene indexes and 10 memcached servers to store the
intermediate scoring and run enough iterations for the scoring to almost
converge (it never converges completely).


memcached? why not store the intermediate values in the MR FS?

Here's Paolo Castagna's MR-based implementation (some logging omitted)

 - first clean the data up
 -count the files and give them all an initial score
 -update the ranks, dealing with dangling pointers
 -repeat until you are happy.
 -sort and display the output

No DB, just HDFS

        exec("Data cleanup", new CheckingData(), inpath, output);
        //count the data
        String countFile = outpath + File.separator + COUNT;
        exec("Page count", new CountPages(), output, countFile);
        String count = read(fs, countFile);
exec("InitializeRanks", new InitializeRanks(), output, input, count);
        int iterations = 0;
        String sortedRanksDir = outpath + File.separator +
                SORTED_RANKS;
        while (iterations < iterationLimit) {
            String danglingFile = outpath + File.separator + DANGLING;
exec("DanglingPages", new DanglingPages(), input, danglingFile);
            String dangling = read(fs, danglingFile);
exec("UpdateRanks", new UpdateRanks(), input, output, count, dangling);
            overwrite(fs, new Path(output), new Path(input));
            if ((iterations > CHECK_CONVERGENCE_FREQUENCY)
                    && (iterations % CHECK_CONVERGENCE_FREQUENCY == 0)) {
String convergenceFile = outpath + File.separator + CONVERGENCE;
                exec("CheckConvergence", new CheckConvergence(),
                        input,
                        convergenceFile);
double tolerance = Double.parseDouble(read(fs, convergenceFile));

                if (tolerance <= toleranceArg) {
                    //exit the loop when we are happy
                    break;
                }
            }

            iterations++;
        }

        exec("SortRanks", new SortRanks(), input, sortedRanksDir);




Reply via email to