Re: Dynamic vertices and hama counters
Sorry my bad. Only focused on counter stuff. Didn't pay attention to Vertex related issue. Thought that just want to share counter value between peers. In that case persisting counter value to zk shouldn't be a problem, and won't incur overhead. But if the case is not about counter, please just ignore my previous post. On 17 July 2013 06:59, Edward J. Yoon edwardy...@apache.org wrote: You guys seems totally misunderstood what I am saying. Every BSP processor accesses to ZK's counter concurrently? Do you think it is possible to determine the current total number of vertices in every step without barrier synchronization? As I mentioned before, there is already additional barrier synchronization steps for aggregating and broadcasting global updated vertex count. You can use this steps without *no additional barrier synchronization*. On Wed, Jul 17, 2013 at 5:01 AM, andronat_asf andronat_...@hotmail.com wrote: Thank you everyone, +1 for Tommaso, I will see what I can do about that :) I also believe that ZK is very similar sync() mechanism that Edward is saying, but if we need to sync more info we might need ZK. Thanks again, Anastasis On 15 Ιουλ 2013, at 5:55 μ.μ., Edward J. Yoon edwardy...@apache.org wrote: andronat_asf, To aggregate and broadcast the global count of updated vertices, we calls sync() twice. See the doAggregationUpdates() method in GraphJobRunner. You can solve your problem the same way, and there will be no additional cost. Use of Zookeeper is not bad idea. But IMO, it's not much different with sync() mechanism. On Mon, Jul 15, 2013 at 10:05 PM, Chia-Hung Lin cli...@googlemail.com wrote: +1 for Tommaso's solution. If not every algorithm needs counter service, having an interface with different implementations (in-memory, zk, etc.) should reduce the side effect. On 15 July 2013 15:51, Tommaso Teofili tommaso.teof...@gmail.com wrote: what about introducing a proper API for counting vertices, something like an interface VertexCounter with 2-3 implementations like InMemoryVertexCounter (basically the current one), a DistributedVertexCounter to implement the scenario where we use a separate BSP superstep to count them and a ZKVertexCounter which handles vertices counts as per Chian-Hung's suggestion. Also we may introduce something like a configuration variable to define if all the vertices are needed or just the neighbors (and/or some other strategy). My 2 cents, Tommaso 2013/7/14 Chia-Hung Lin cli...@googlemail.com Just my personal viewpoint. For small size of global information, considering to store the state in ZooKeeper might be a reasonable solution. On 13 July 2013 21:28, andronat_asf andronat_...@hotmail.com wrote: Hello everyone, I'm working on HAMA-767 and I have some concerns on counters and scalability. Currently, every peer has a set of vertices and a variable that is keeping the total number of vertices through all peers. In my case, I'm trying to add and remove vertices during the runtime of a job, which means that I have to update all those variables. My problem is that this is not efficient because in every operation (add or remove a vertex) I need to update all peers, so I need to send lots of messages to make those updates (see GraphJobRunner#countGlobalVertexCount method) and I believe this is not correct and scalable. An other problem is that, even if I update all those variable (with the cost of sending lots of messages to every peer) those variables will be updated on the next superstep. e.g.: Peer 1:Peer 2: Vert_1 Vert_2 (Total_V = 2) (Total_V = 2) addVertex() (Total_V = 3) getNumberOfV() = 2 Sync getNumberOfV() = 3 Is there something like global counters or shared memory that it can address this issue? P.S. I have a small feeling that we don't need to track the total amount of vertices because vertex centered algorithms rarely need total numbers, they only depend on neighbors (I might be wrong though). Thanks, Anastasis -- Best Regards, Edward J. Yoon @eddieyoon -- Best Regards, Edward J. Yoon @eddieyoon
Re: Dynamic vertices and hama counters
what about introducing a proper API for counting vertices, something like an interface VertexCounter with 2-3 implementations like InMemoryVertexCounter (basically the current one), a DistributedVertexCounter to implement the scenario where we use a separate BSP superstep to count them and a ZKVertexCounter which handles vertices counts as per Chian-Hung's suggestion. Also we may introduce something like a configuration variable to define if all the vertices are needed or just the neighbors (and/or some other strategy). My 2 cents, Tommaso 2013/7/14 Chia-Hung Lin cli...@googlemail.com Just my personal viewpoint. For small size of global information, considering to store the state in ZooKeeper might be a reasonable solution. On 13 July 2013 21:28, andronat_asf andronat_...@hotmail.com wrote: Hello everyone, I'm working on HAMA-767 and I have some concerns on counters and scalability. Currently, every peer has a set of vertices and a variable that is keeping the total number of vertices through all peers. In my case, I'm trying to add and remove vertices during the runtime of a job, which means that I have to update all those variables. My problem is that this is not efficient because in every operation (add or remove a vertex) I need to update all peers, so I need to send lots of messages to make those updates (see GraphJobRunner#countGlobalVertexCount method) and I believe this is not correct and scalable. An other problem is that, even if I update all those variable (with the cost of sending lots of messages to every peer) those variables will be updated on the next superstep. e.g.: Peer 1:Peer 2: Vert_1 Vert_2 (Total_V = 2) (Total_V = 2) addVertex() (Total_V = 3) getNumberOfV() = 2 Sync getNumberOfV() = 3 Is there something like global counters or shared memory that it can address this issue? P.S. I have a small feeling that we don't need to track the total amount of vertices because vertex centered algorithms rarely need total numbers, they only depend on neighbors (I might be wrong though). Thanks, Anastasis
Re: Dynamic vertices and hama counters
+1 for Tommaso's solution. If not every algorithm needs counter service, having an interface with different implementations (in-memory, zk, etc.) should reduce the side effect. On 15 July 2013 15:51, Tommaso Teofili tommaso.teof...@gmail.com wrote: what about introducing a proper API for counting vertices, something like an interface VertexCounter with 2-3 implementations like InMemoryVertexCounter (basically the current one), a DistributedVertexCounter to implement the scenario where we use a separate BSP superstep to count them and a ZKVertexCounter which handles vertices counts as per Chian-Hung's suggestion. Also we may introduce something like a configuration variable to define if all the vertices are needed or just the neighbors (and/or some other strategy). My 2 cents, Tommaso 2013/7/14 Chia-Hung Lin cli...@googlemail.com Just my personal viewpoint. For small size of global information, considering to store the state in ZooKeeper might be a reasonable solution. On 13 July 2013 21:28, andronat_asf andronat_...@hotmail.com wrote: Hello everyone, I'm working on HAMA-767 and I have some concerns on counters and scalability. Currently, every peer has a set of vertices and a variable that is keeping the total number of vertices through all peers. In my case, I'm trying to add and remove vertices during the runtime of a job, which means that I have to update all those variables. My problem is that this is not efficient because in every operation (add or remove a vertex) I need to update all peers, so I need to send lots of messages to make those updates (see GraphJobRunner#countGlobalVertexCount method) and I believe this is not correct and scalable. An other problem is that, even if I update all those variable (with the cost of sending lots of messages to every peer) those variables will be updated on the next superstep. e.g.: Peer 1:Peer 2: Vert_1 Vert_2 (Total_V = 2) (Total_V = 2) addVertex() (Total_V = 3) getNumberOfV() = 2 Sync getNumberOfV() = 3 Is there something like global counters or shared memory that it can address this issue? P.S. I have a small feeling that we don't need to track the total amount of vertices because vertex centered algorithms rarely need total numbers, they only depend on neighbors (I might be wrong though). Thanks, Anastasis
Re: Dynamic vertices and hama counters
andronat_asf, To aggregate and broadcast the global count of updated vertices, we calls sync() twice. See the doAggregationUpdates() method in GraphJobRunner. You can solve your problem the same way, and there will be no additional cost. Use of Zookeeper is not bad idea. But IMO, it's not much different with sync() mechanism. On Mon, Jul 15, 2013 at 10:05 PM, Chia-Hung Lin cli...@googlemail.com wrote: +1 for Tommaso's solution. If not every algorithm needs counter service, having an interface with different implementations (in-memory, zk, etc.) should reduce the side effect. On 15 July 2013 15:51, Tommaso Teofili tommaso.teof...@gmail.com wrote: what about introducing a proper API for counting vertices, something like an interface VertexCounter with 2-3 implementations like InMemoryVertexCounter (basically the current one), a DistributedVertexCounter to implement the scenario where we use a separate BSP superstep to count them and a ZKVertexCounter which handles vertices counts as per Chian-Hung's suggestion. Also we may introduce something like a configuration variable to define if all the vertices are needed or just the neighbors (and/or some other strategy). My 2 cents, Tommaso 2013/7/14 Chia-Hung Lin cli...@googlemail.com Just my personal viewpoint. For small size of global information, considering to store the state in ZooKeeper might be a reasonable solution. On 13 July 2013 21:28, andronat_asf andronat_...@hotmail.com wrote: Hello everyone, I'm working on HAMA-767 and I have some concerns on counters and scalability. Currently, every peer has a set of vertices and a variable that is keeping the total number of vertices through all peers. In my case, I'm trying to add and remove vertices during the runtime of a job, which means that I have to update all those variables. My problem is that this is not efficient because in every operation (add or remove a vertex) I need to update all peers, so I need to send lots of messages to make those updates (see GraphJobRunner#countGlobalVertexCount method) and I believe this is not correct and scalable. An other problem is that, even if I update all those variable (with the cost of sending lots of messages to every peer) those variables will be updated on the next superstep. e.g.: Peer 1:Peer 2: Vert_1 Vert_2 (Total_V = 2) (Total_V = 2) addVertex() (Total_V = 3) getNumberOfV() = 2 Sync getNumberOfV() = 3 Is there something like global counters or shared memory that it can address this issue? P.S. I have a small feeling that we don't need to track the total amount of vertices because vertex centered algorithms rarely need total numbers, they only depend on neighbors (I might be wrong though). Thanks, Anastasis -- Best Regards, Edward J. Yoon @eddieyoon