I have a few comments about the current design of Giraph regarding the
implicit creation of vertices.
As it's currently designed, if you send a message to a non-existent
vertices, Giraph creates it for you.
Although I can understand it can get handy as it allows for lazy
dataset creation, I think it comes at some cost and I believe this
cost is bigger than the advantage:
1) it overlaps the mutation API, where a vertex can be created
explicitly when the semantics of the algorithm require it, with
knowledge about what's going on and with explicit state. This is an
ambiguous and unclear part of the API which is difficult for me to
justify and probably confusing for the user too. Which brings me to
the second point.
2) it requires a different, and partially duplicate,code path for
mutations and implicit vertex creation in our code, as it's clear by
looking at BasicRPCCommunication and as it's been experienced
currently by me in the email I recently sent to the list. Which brings
me to the third point.
3) in order to manage this, for every message we have to hit, sooner
or later, the Worker vertices set to see if the vertex is existing and
whether it should be implicitly created. This is computationally
expensive both if you have a HashMap but also if you have a TreeMap
for range partitioning. Also, if we're going to create more exotic
partitioning (topology-partitioning?), we're going to hit the problem
In general, I don't know any graph API that doesn't require to either
list explicitly the vertex set at load or to create the vertex
explicitly through API. As I said, I understand it allows for lazy
creation of the input file, with possibly missing vertices explicitly
enlisted (missing as a source vertex but existing as an endpoint for
an edge), but this could be really fixed robustly by a single
What do you guys think?