# [jira] [Commented] (GIRAPH-26) Improve PseudoRandomVertexInputFormat to create a more realistic synthetic graph (e.g. power-law distributed vertex-cardinality).

```    [
https://issues.apache.org/jira/browse/GIRAPH-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13099043#comment-13099043
] ```
```
Jake Mannix commented on GIRAPH-26:
-----------------------------------

Dmitriy, I was thinking along the lines of a less-rigorous version of the
kroeneker product technique, as described in that paper.  Essentially you do
the following, to find the adjacency matrix entry at row 1523, column 4918,
take four "seed matrices" (most likely pseudorandom variations on a single
fixed seed matrix), each of size 10x10 (could be any base, but it's easiest to
consider decimal notation, and 10 is big enough to actually have a sensible
small graph you're going to kroenker iterate on):

A_{1523, 4918} = S_{1,4} * T_{5,4} * U_{2,1} * V_{3,8}

This requires no communication between the vertexes, and is totally fixed (ie
deterministic) once you choose your seed matrices.

The trick is to figure out (fast!) which entries are nonzero, for any given
vertex:

A_{1523, *} = S_{1,*} * T_{5,*} * U_{2,*} * V_{3,*}

So at this point, you know you're dealing with row 1 of S, row 5 of T, row 2 of
U, and row 3 of V.  Well, if each of these matrices is represented sparsely,
and you can iterate quickly over *only* their nonzero entries (and there should
be only a few in each row, as we're talking about a 10-vertex seed graph which
is not too connected), and thus iterate over only the nonzero entries in the
edge list for vertex 1523 very quickly as well: if row 1 of S has nonzero
entries at points 1,4,9, row 5 of T has nonzeroes at 5,6, row 2 of U has
nonzeroes only at 0, and row 3 of V has nonzeroes at 0,3,4, then we'll have a
total of 3*2*1*3 = 18 nonzero edges for vertex 1523, and we can iterate over
them very quickly.

I think this should make a pretty nice, relatively scale-free graph which has
density parametrizable by initial 10x10 seed graph, with randomness introduced
by how you want to create those S,T,U... variations on that seed.  And it
should be embarrassingly parallel.

Anything I'm saying make no sense at all?

> Improve PseudoRandomVertexInputFormat to create a more realistic synthetic
> graph (e.g. power-law distributed vertex-cardinality).
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-26
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-26
>             Project: Giraph
>          Issue Type: Test
>          Components: benchmark
>            Reporter: Jake Mannix
>            Priority: Minor
>
> The PageRankBenchmark class, to be a proper benchmark, should run over graphs
> which look more like data seen in the wild, and web link graphs, social
> network graphs, and text corpora (represented as a bipartite graph) all have
> power-law distributions, so benchmarking a synthetic graph which looks more
> like this would be a nice test which would stress cases of uneven
> split-distribution and bottlenecks of subclusters of the graph of heavily
> connected vertices.

--
This message is automatically generated by JIRA.

```