On Wed, Jun 9, 2010 at 4:57 PM, Neal Clark <[email protected]> wrote:

> I am currently looking for a good way to generate large graphs or get
> access
> to large graph datasets. Since the number of vertexes usually far exceeds
> the number of reduce nodes the overloading a single vertex shouldn't be too
> much of a problem. More testing is needed to confirm this. If overloading
> does prove to be a problem we can use a two phase approach to determine the
> min vertexes.
>
> Open questions:
> 1) What input formats should be supported?
>

We don't have a good answer for that yet.  Recent discussions have talked a
lot about how to vectorize more ordinary text.


> 2) Do you have any suggestions on what intermediary format could be used
> between phases?
>

These should be sequence files of some kind.  Using the Mahout vector format
would probably work well at the cost of a bit of overhead due to using
doubles to store integers.


> 3) How best to approach integrating these algorithms into Mahout?
>

you are breaking new ground here with graph algorithms in mahout.


>  4) Does anyone know where I can find some large test graphs?
>

Consider the wikipedia link graph.  Also interesting might be the
cooccurrence graph of words in a large corpus.  The twitter social graph
might be interesting as well.


> 5) Do you think that this type of algorithm is a good fit for Mahout?
>

I do even though we haven't had much pull for graph algorithms yet.   This
could easily change.

Reply via email to