On Wed, Jun 9, 2010 at 4:57 PM, Neal Clark <[email protected]> wrote: > I am currently looking for a good way to generate large graphs or get > access > to large graph datasets. Since the number of vertexes usually far exceeds > the number of reduce nodes the overloading a single vertex shouldn't be too > much of a problem. More testing is needed to confirm this. If overloading > does prove to be a problem we can use a two phase approach to determine the > min vertexes. > > Open questions: > 1) What input formats should be supported? >
We don't have a good answer for that yet. Recent discussions have talked a lot about how to vectorize more ordinary text. > 2) Do you have any suggestions on what intermediary format could be used > between phases? > These should be sequence files of some kind. Using the Mahout vector format would probably work well at the cost of a bit of overhead due to using doubles to store integers. > 3) How best to approach integrating these algorithms into Mahout? > you are breaking new ground here with graph algorithms in mahout. > 4) Does anyone know where I can find some large test graphs? > Consider the wikipedia link graph. Also interesting might be the cooccurrence graph of words in a large corpus. The twitter social graph might be interesting as well. > 5) Do you think that this type of algorithm is a good fit for Mahout? > I do even though we haven't had much pull for graph algorithms yet. This could easily change.
