[ https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103422#comment-13103422 ]
Avery Ching commented on GIRAPH-12: ----------------------------------- Hyunsik, just to update, I grabbed your patch and it passed unittest on my machine. Then I ran it on a cluster at Yahoo!. I didn't have time to make a messaging benchmark, so I ran PageRankBenchmark. I ran with 100 workers, 1 M vertices, 3 supersteps, and 10 edges per vertex. Here are 2 runs with the original code: 11/09/13 07:02:08 INFO mapred.JobClient: Giraph Timers 11/09/13 07:02:08 INFO mapred.JobClient: Total (milliseconds)=46709 11/09/13 07:02:08 INFO mapred.JobClient: Superstep 3 (milliseconds)=1682 11/09/13 07:02:08 INFO mapred.JobClient: Setup (milliseconds)=3228 11/09/13 07:02:08 INFO mapred.JobClient: Shutdown (milliseconds)=1223 11/09/13 07:02:08 INFO mapred.JobClient: Vertex input superstep (milliseconds)=3578 11/09/13 07:02:08 INFO mapred.JobClient: Superstep 0 (milliseconds)=16222 11/09/13 07:02:08 INFO mapred.JobClient: Superstep 2 (milliseconds)=12302 11/09/13 07:02:08 INFO mapred.JobClient: Superstep 1 (milliseconds)=8467 13 07:14:51 INFO mapred.JobClient: Giraph Timers 11/09/13 07:14:51 INFO mapred.JobClient: Total (milliseconds)=51475 11/09/13 07:14:51 INFO mapred.JobClient: Superstep 3 (milliseconds)=1348 11/09/13 07:14:51 INFO mapred.JobClient: Setup (milliseconds)=7233 11/09/13 07:14:51 INFO mapred.JobClient: Shutdown (milliseconds)=884 11/09/13 07:14:51 INFO mapred.JobClient: Vertex input superstep (milliseconds)=3284 11/09/13 07:14:51 INFO mapred.JobClient: Superstep 0 (milliseconds)=22213 11/09/13 07:14:51 INFO mapred.JobClient: Superstep 2 (milliseconds)=8553 11/09/13 07:14:51 INFO mapred.JobClient: Superstep 1 (milliseconds)=7955 Here are 2 runs with your code: 11/09/13 07:06:56 INFO mapred.JobClient: Giraph Timers 11/09/13 07:06:56 INFO mapred.JobClient: Total (milliseconds)=51935 11/09/13 07:06:56 INFO mapred.JobClient: Superstep 3 (milliseconds)=1150 11/09/13 07:06:56 INFO mapred.JobClient: Setup (milliseconds)=3338 11/09/13 07:06:56 INFO mapred.JobClient: Shutdown (milliseconds)=833 11/09/13 07:06:56 INFO mapred.JobClient: Vertex input superstep (milliseconds)=3401 11/09/13 07:06:56 INFO mapred.JobClient: Superstep 0 (milliseconds)=17297 11/09/13 07:06:56 INFO mapred.JobClient: Superstep 2 (milliseconds)=14384 11/09/13 07:06:56 INFO mapred.JobClient: Superstep 1 (milliseconds)=11528 11/09/13 07:12:09 INFO mapred.JobClient: Giraph Timers 11/09/13 07:12:09 INFO mapred.JobClient: Total (milliseconds)=51985 11/09/13 07:12:09 INFO mapred.JobClient: Superstep 3 (milliseconds)=1362 11/09/13 07:12:09 INFO mapred.JobClient: Setup (milliseconds)=3776 11/09/13 07:12:09 INFO mapred.JobClient: Shutdown (milliseconds)=710 11/09/13 07:12:09 INFO mapred.JobClient: Vertex input superstep (milliseconds)=3771 11/09/13 07:12:09 INFO mapred.JobClient: Superstep 0 (milliseconds)=17741 11/09/13 07:12:09 INFO mapred.JobClient: Superstep 2 (milliseconds)=13068 11/09/13 07:12:09 INFO mapred.JobClient: Superstep 1 (milliseconds)=11551 In my limited testing, numbers aren't too different. I also see that the connections are maintained throughout the application run as you mentioned. So the only tradeoff is possibly the reduced parallelization of message sending (user chosen vs all threads). I like the approach and think it's an improvement (controllable threads). Perhaps the only comment is that regarding the following code block. for(PeerConnection pc : peerConnections.values()) { futures.add(executor.submit(new PeerFlushExecutor(pc))); } Probably would be good to randomize the PeerConnection objects to avoid hotspots on the receiving side? > Investigate communication improvements > -------------------------------------- > > Key: GIRAPH-12 > URL: https://issues.apache.org/jira/browse/GIRAPH-12 > Project: Giraph > Issue Type: Improvement > Components: bsp > Reporter: Avery Ching > Assignee: Hyunsik Choi > Priority: Minor > Attachments: GIRAPH-12_1.patch > > > Currently every worker will start up a thread to communicate with every other > workers. Hadoop RPC is used for communication. For instance if there are > 400 workers, each worker will create 400 threads. This ends up using a lot > of memory, even with the option > -Dmapred.child.java.opts="-Xss64k". > It would be good to investigate using frameworks like Netty or custom roll > our own to improve this situation. By moving away from Hadoop RPC, we would > also make compatibility of different Hadoop versions easier. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira