Jake Mannix commented on GIRAPH-26:

Dmitriy, I was thinking along the lines of a less-rigorous version of the 
kroeneker product technique, as described in that paper.  Essentially you do 
the following, to find the adjacency matrix entry at row 1523, column 4918, 
take four "seed matrices" (most likely pseudorandom variations on a single 
fixed seed matrix), each of size 10x10 (could be any base, but it's easiest to 
consider decimal notation, and 10 is big enough to actually have a sensible 
small graph you're going to kroenker iterate on): 

A_{1523, 4918} = S_{1,4} * T_{5,4} * U_{2,1} * V_{3,8}

This requires no communication between the vertexes, and is totally fixed (ie 
deterministic) once you choose your seed matrices.

The trick is to figure out (fast!) which entries are nonzero, for any given 

A_{1523, *} = S_{1,*} * T_{5,*} * U_{2,*} * V_{3,*}

So at this point, you know you're dealing with row 1 of S, row 5 of T, row 2 of 
U, and row 3 of V.  Well, if each of these matrices is represented sparsely, 
and you can iterate quickly over *only* their nonzero entries (and there should 
be only a few in each row, as we're talking about a 10-vertex seed graph which 
is not too connected), and thus iterate over only the nonzero entries in the 
edge list for vertex 1523 very quickly as well: if row 1 of S has nonzero 
entries at points 1,4,9, row 5 of T has nonzeroes at 5,6, row 2 of U has 
nonzeroes only at 0, and row 3 of V has nonzeroes at 0,3,4, then we'll have a 
total of 3*2*1*3 = 18 nonzero edges for vertex 1523, and we can iterate over 
them very quickly.

I think this should make a pretty nice, relatively scale-free graph which has 
density parametrizable by initial 10x10 seed graph, with randomness introduced 
by how you want to create those S,T,U... variations on that seed.  And it 
should be embarrassingly parallel.

Anything I'm saying make no sense at all?

> Improve PseudoRandomVertexInputFormat to create a more realistic synthetic 
> graph (e.g. power-law distributed vertex-cardinality).
> ---------------------------------------------------------------------------------------------------------------------------------
>                 Key: GIRAPH-26
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-26
>             Project: Giraph
>          Issue Type: Test
>          Components: benchmark
>            Reporter: Jake Mannix
>            Priority: Minor
> The PageRankBenchmark class, to be a proper benchmark, should run over graphs 
> which look more like data seen in the wild, and web link graphs, social 
> network graphs, and text corpora (represented as a bipartite graph) all have 
> power-law distributions, so benchmarking a synthetic graph which looks more 
> like this would be a nice test which would stress cases of uneven 
> split-distribution and bottlenecks of subclusters of the graph of heavily 
> connected vertices.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to