Re: [GSoC] SimRank algorithms on Mahout

Ted Dunning Tue, 24 Mar 2009 11:33:26 -0700

Answering some of your email out of order,

On Mon, Mar 23, 2009 at 10:00 PM, Xuan Yang <sailingw...@gmail.com> wrote:


> These days I am doing some research work on SimRank, which is an model
> measuring "similarity" of objects.


Great.



> I think it would be great to solve these problems and implement a
> mapreduce-version of algorithm for SimRank.
> I intend to implement this as my Summer of Code project. Would you be
> interested in this?


This sounds like a fine project.


> And can I get some advices from you?


I am sure you can lots of advice from this group, both on the algorithm and
suggestions on how to code it into a program.

Back to your detailed suggestion.  Here are some of my first thoughts:


> 1, the directed graph could be saved in the form of edge list in hbase. And
> the Result Sn(a,b) could also be saved in hbase as matrix.


Hbase or flat files would be a fine way to store this and an edge list is an
excellent way to store the data.

The output matrix should probably be stored as triples containing row,
column and value.


> 2, We can distribute all the n^2 pairs into the map nodes to calculate
> SimRank value of the next iteration.


Hopefully you can keep this sparse.  If you cannot, then the algorithm may
not be suitable for use on large data no matter how you parallelize it.

Skipping item 3 because I don't have time right now to analyze it in
detail...


> 4, besides, there are other optimization methods such as threshold could be
> used in Map nodes and Reduce nodes.


Thresholding is likely to be a critical step in order to preserve sparsity.


> 1, It is true that mapreduce could make the computation of each node more
> easier. Yet if the volume of data is very huge, the transport latency of
> data will become more and more serious.


I think that you will find that with map-reduce in general and with Hadoop
more specifically, that as the problem gets larger, the discipline imposed
by map-reduce formulation on your data transport patterns actually allows
better scaling than you would expect.  Of course, if your data size scales
with n^2, you are in trouble no matter how your parallelize.

A good example came a year or so ago with a machine translation group at a
university in Maryland.  They had a large program that attempted to do
coocurrence counting on text corpora using a single multi-core machine.
They started to convert this to Hadoop using the simplest possible
representation for the cooccurrence matrix (index, value triples) and
expected that the redundancy of this representation would lead to very bad
results.  Since they expected bad results, they also expected to do lots of
optimization on the map-reduce version.  Also, since the original program
was largely memory based, they expected that the communication overhead of
hadoop would severely hurt performance.

The actual results were that an 18 hour program run on 70 machines took 20
minutes.  This is nearly perfect speedup over the sequential version.  The
moral is that highly sequential transport of large blocks of information can
be incredibly efficient.

So, methods to reduce IO would be
> very helpful.


My first recommendation on this is to wait.  Get and implementation first,
then optimize.  The problems you have will not be the problems you expect.


> 2, SimRank is to compute the similarity between all the nodes. If we map a
> group of nodes {A, B, C} into one map node, and {D, E, F} into another map
> node. The computation inside set {A, B, C} will be easy, so will be set {D,
> E, F}. But when we want to compute SimRank between A and D, It will not be
> very convenient.


Map nodes should never communicate to each other.  That is the purpose of
the reduce layer.

I think that what you should do is organize your recursive step so that the
sum happens in the reduce.  Then each mapper would output records where the
key is the index pair for the summation (a and b in the notation used on
wikipedia) and the reduce does this summation.  This implies that you
change  your input format slightly to be variable length records containing
a node index and the In set for that node.  This transformation is a very
simple, one time map-reduce step.

More specifically, you would have original input which initially has zero
values for R:

   links: (Node from, Node to, double R)

and a transform MR step that does this to produce an auxilliary file
inputSets: (Node to, List<Node> inputs):

    map: (Node from, Node to) -> (to, from)
    reduce: (Node to, List<Node> inputs) -> to, inputs

Now you need to join the original input to the auxilliary file on both the
from and to indexes.  This join would require two map-reduces, one to join
on the from index and one to join on the to index.  The reduce in the final
step should emit the cross product of the input sets.  Then you need to join
that against the original data.  That join would require a single map-reduce
for the join.  Finally, you need to group on the to index and sum up all of
the distances and output a data set for the next round of computation.

This is pretty complicated, but not all that hard to do.

The key, as I mentioned before, will be to avoid catastrophic fill-in of
your distance matrix.  If it becomes non-sparse, then the cartesian products
in the joins above will be absolutely catastrophic because you will have far
too much data flying around.  This problem is not specific to map-reduce, it
is a problem with simRank itself.

I would also recommend that you look at some alternative distance measures.
One simple one is just cooccurrence as filtered by a simple test.  This
requires fewer map-reduce steps, does not require multiple iterations and
cannot produce fill-in.  I have used iterative algorithms similar to
simRank, but my personal experience is that this simpler algorithm produces
recommendations on par with much more complex algorithms and that subsequent
presentation level changes have more potential for improvement than
algorithmic changes.

I hope that this helps.

-- 
Ted Dunning, CTO
DeepDyve

Re: [GSoC] SimRank algorithms on Mahout

Reply via email to