Hi Nilesh,

We've had an implementation of PageRank, which was removed a few releases ago. I don't think PageRank would be a valuable contribution, because I don't see a MapReduce based implementation being able to compete with systems such as Apache Giraph that also run on a standard Hadoop cluster.

If you wanted to work on a block-based version of our matrix multiplication code (that is used in some algorithms such as our existing Lanczos implementation afaik) that would be a very valuable contribution, however.

@list Any other opinions on that?

Best,
Sebastian

On 01/28/2014 06:07 PM, Nilesh Chakraborty (JIRA) wrote:

     [ 
https://issues.apache.org/jira/browse/MAHOUT-742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884332#comment-13884332
 ]

Nilesh Chakraborty commented on MAHOUT-742:
-------------------------------------------

Hey [~ssc], I've been working on an implementation of large-scale blockwise 
matrix-vector multiplication using a single MapReduce job. The current 
algorithms and implementations need two MapReduce jobs for blockwise 
multiplication (or any sparse mat-vec mult where the vector isn't stored in 
memory). I will be using it to implement PageRank. I'll benchmark my 
implementation against the state-of-the-art in MapReduce-based PageRank - 
Pegasus (they've contributed the Pegasus code to 
https://github.com/intel-hadoop/HiBench).

If my version turns out to be faster, I'll be writing the code for algorithms 
like SVD and Lanczos algorithm (http://en.wikipedia.org/wiki/Lanczos_algorithm) 
too.

Do you think these can make for a useful contribution to Mahout? I need to keep 
that in mind before I go forward with coding.

Pagerank implementation in Map/Reduce
-------------------------------------

                 Key: MAHOUT-742
                 URL: https://issues.apache.org/jira/browse/MAHOUT-742
             Project: Mahout
          Issue Type: New Feature
          Components: Graph
    Affects Versions: 0.6
            Reporter: Christoph Nagel
            Assignee: Sebastian Schelter
             Fix For: 0.6

         Attachments: MAHOUT-742.patch


Hi,
my name is Christoph Nagel. I'm student on technical university Berlin and 
participating on the course of Isabel Drost and Sebastian Schelter.
My work is to implement the pagerank-algorithm, where the pagerank-vector fits 
in memory.
For the computation I used the naive algorithm shown in the book 'Mining of Massive 
Datasets' from Rajaraman & Ullman 
(http://www-scf.usc.edu/~csci572/2012Spring/UllmanMiningMassiveDataSets.pdf).
Matrix- and vector-multiplication are done with mahout methods.
Most work is the transformation the input graph, which has to consists of a 
nodes- and edges file.
Format of nodes file: <node>\n
Format of edges file: <startNode>\t<endNode>\n
Therefore I created the following classes:
* LineIndexer: assigns each line an index
* EdgesToIndex: indexes the nodes of the edges
* EdgesIndexToTransitionMatrix: creates the transition matrix
* Pagerank: computes PR from transition matrix
* JoinNodesWithPagerank: creates the joined output
* PagerankExampleJob: does the complete job
Each class has a test (not PagerankExampleJob) and I took the example of the 
book for evaluating.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Reply via email to