you are actually having two graphs in your use case description :) The
first one is a bipartite graph consisting of a set of search queries on
the one hand and a set of web pages on the other hand.
>From this graph you want to create another graph where all pages that
share a common query are connected. This is an algorithmic problem, not
just a way of formatting the input.
There are several ways to create this second graph. One way would be to
use Mahout's ItemSimilarityJob with (query,page) tuples as input.
ItemSimilarityJob will give you all or the top-k similar pages per page
then (which will be (page,page) tuples). From this output you could
create the second graph very easily.
Alternatively you could think about implementing a pairwise similarity
algorithm yourself in Giraph. You basically would need to find all pairs
of vertices that share a common neighbor.
On 08.05.2012 19:17, Raimon Bosch wrote:
> I'm designing a model to graph my web visits using the data in Access Log.
> My idea is to create edges between my pages throught queries comming from
> Google i.e. if a user searches for "used cars in NY" and hits one of my
> pages (say A), and one month later another user searches for "used cars in
> NY" and hits another of my pages (say B) I can create an edge between A and
> B where the value of the edge will be the number of pages viewed for those
> 2 users.
> So my question is more directly related with the format used in Apache
> - Can I give values to the edges? i.e. A to B (cost is 6), and B to C (cost
> is 4).
> - In the shortestPath example we have an input like this:
> Can you give us an overview how does this graph would look like? That would
> be a nice document for the wiki page.
> - How it will be the input for my use case? (A -> B (cost 6), B -> C (cost
> Thanks in advance,
> Raimon Bosch.
> pd: Some feedback about my model would be appreciated too. I haven't found
> any papers about this topic yet.