Unfortunately, someone (probably me), needs to make a wiki on this
issue. Currently, we require that your vertices are globally sorted by
vertex id and that the vertices read in each input split are in order by
vertex id. That probably explains the weirdness you are seeing. This
issue is being addressed (albeit slowly because of new job) in
https://issues.apache.org/jira/browse/GIRAPH-11. The issue is also
described a bit more fully there.
Avery
On 10/1/11 12:44 PM, Aapo Kyrola wrote:
Hi,
I have a very difficult problem to debug. Several vertices seem to be
duplicated -
maybe I am not reading the inputs properly? Here is more info:
- I have three input splits and use three workers. I have written my
own input-dataformat
(part of the zip I sent few days ago). In split one, i have ids mod 3
= 0, then ids mod 3 = 1 etc.
I added some extra debug vertex id 875600:
- I checked that the vertex 875600 is read only once, with 8 edges by
adding a System.out.println debug:
::: READ: 875600 ; 8 : [81066, 271870, 272882, 483962, 621946, 723717,
834555, 845506]
- in the vertex.compute I will write the hostname of the computer and
how many messsages, and
eedges there are. From here I see that this vertex appear on two
different hosts because I get
two types of outputs:
hostA.ml.cmu.edu <http://hostA.ml.cmu.edu> 875600* => 0.0 / 0.0
msgs=0/6813839/8
hostB.ml.cmu.edu <http://hostB.ml.cmu.edu> 875600* =>
-3.4657359027997265 / -3.4657359027997265 msgs=5/6813839/0
Note that the last string the debug is
num-of-messages/num-edges/num-out-edges.
In the hostB, this vertex has no edges, but on host A, it has the
correct 8 edges.
--
Does it matter how I split the vertex-ids?
ps. For next report I will make an Apache account. Too busy now..
Aapo Kyrola
Ph.D. student, http://www.cs.cmu.edu/~akyrola
<http://www.cs.cmu.edu/%7Eakyrola>