Re: Please welcome our newest committer and PMC member, Eugene!
Congrats and welcome Eugene! I'm looking forward to your contribution. -- Hyunsik Choi On Wed, May 2, 2012 at 5:39 AM, Jakob Homan jgho...@gmail.com wrote: I'm happy to announce that the Giraph PMC has voted Eugene Koontz in as a committer and PMC member. Eugene has been pitching in with great patches that have been very useful, such as helping us sort out our terrifying munging situation (GIRAPH-168). Welcome aboard, Eugene! -Jakob
[jira] [Assigned] (GIRAPH-174) ConnectedComponentsVertex for loops can be replaced with for-each loops
[ https://issues.apache.org/jira/browse/GIRAPH-174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyunsik Choi reassigned GIRAPH-174: --- Assignee: Roman K ConnectedComponentsVertex for loops can be replaced with for-each loops --- Key: GIRAPH-174 URL: https://issues.apache.org/jira/browse/GIRAPH-174 Project: Giraph Issue Type: Improvement Reporter: Jakob Homan Assignee: Roman K Priority: Trivial Labels: newbie Attachments: GIRAPH-174.patch {code}// First superstep is special, because we can simply look at the neighbors if (getSuperstep() == 0) { for (IteratorIntWritable edges = iterator(); edges.hasNext();) { int neighbor = edges.next().get(); if (neighbor currentComponent) { currentComponent = neighbor; } } // Only need to send value if it is not the own id if (currentComponent != getVertexValue().get()) { setVertexValue(new IntWritable(currentComponent)); for (IteratorIntWritable edges = iterator(); edges.hasNext();) { int neighbor = edges.next().get(); if (neighbor currentComponent) { sendMsg(new IntWritable(neighbor), getVertexValue()); } } }{code} Both of the for loops in this chunk from ConnectedComponentsVertex can be replaced with for(IntWritable i : iterator()) loops to be more idiomatic. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-185) Improve concurrency of putMsg / putMsgList
[ https://issues.apache.org/jira/browse/GIRAPH-185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262278#comment-13262278 ] Hyunsik Choi commented on GIRAPH-185: - If there is a trade-off relationship between the performance and memory consumption, the memory consumption seems more important in the current giraph implementation. Also, I agree that some benchmarks are necessary. Improve concurrency of putMsg / putMsgList -- Key: GIRAPH-185 URL: https://issues.apache.org/jira/browse/GIRAPH-185 Project: Giraph Issue Type: Improvement Components: graph Affects Versions: 0.2.0 Reporter: Bo Wang Assignee: Bo Wang Fix For: 0.2.0 Attachments: GIRAPH-185.patch, GIRAPH-185.patch Original Estimate: 2h Remaining Estimate: 2h Currently in putMsg / putMsgList, a synchronized closure is used to protect the whole transientInMessages when adding the new message. This lock prevents other concurrent calls to putMsg/putMsgList and increases the response time. We should use fine-grain locks to allow high concurrency in message communication. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: hadoop version profiles
This approach looks good. +1 On Wed, Mar 21, 2012 at 3:53 PM, Avery Ching ach...@apache.org wrote: I agree with this approach, although munge is kinda hacky. It is easy though. =) Avery On 3/20/12 5:52 PM, Eugene Koontz wrote: Hi Giraphers, I think it might be good to look at how we can add support for new hadoop versions. Currently we have hadoop_facebook ( https://issues.apache.org/**jira/browse/GIRAPH-14https://issues.apache.org/jira/browse/GIRAPH-14). I am considering adding new ones such as hadoop_0.24. Looking at the code, it seems that the main hadoop variation between the stock hadoop used (0.203.0) versus facebook has to do with the new security-related APIs in the latter that is, fortunately, also available in hadoop 0.23 and 0.24. So, hopefully we can make use of the existing work that Avery has done for hadoop_facebook and apply it to other hadoop versions. Therefore I would propose that: 1. a new munge flag HADOOP_SECURE to be used in RPCCommunication.java and a few other places, where we are currently checking for HADOOP_FACEBOOK and HADOOP. 2. we make a new profile called hadoop_secure, which, as with hadoop_facebook, will use the above munge flag. 3. we make a new profile hadoop_0.20.203 for the existing default hadoop and make it the default profile (activeByDefault=true). This will makes it easier to handle the differences in the hadoop library dependency set that have happened between 0.20.203 and hadoop trunk. Please see https://github.com/ekoontz/**giraph/tree/security-profilehttps://github.com/ekoontz/giraph/tree/security-profilefor my branch that implements the above. Thanks, -Eugene
Re: [VOTE] Release Giraph 0.1-incubating (rc0)
I also checked the compile, all tests pass, gpg sign, and md5 sign. +1 for both the source release and the binary tarball release. -- Hyunsik Choi On Wed, Feb 1, 2012 at 8:36 AM, Jakob Homan jgho...@gmail.com wrote: Giraphers- I've created a candidate for our first release. It's a source release without a binary for two reasons: first, there's still discussion going on about what needs to be done for the NOTICE and LICENSE files for projects that bring in transitive dependencies to the binary release ( http://www.mail-archive.com/general@incubator.apache.org/msg32693.html) and second because we're still munging our binary against three types of Hadoop, which would mean we'd need to release three different binary artifacts, which seems suboptimal. Hopefully both of these issues will be addressed by 0.2. I've tested the release against an unsecure 20.2 cluster. It'd be great to test it against other configurations. Note that we're voting on the tag; the files are provided as a convenience. Release notes: http://people.apache.org/~jghoman/giraph-0.1.0-incubating-rc0/RELEASE_NOTES.html Release artifacts: http://people.apache.org/~jghoman/giraph-0.1.0-incubating-rc0/ Corresponding svn tag: http://svn.apache.org/repos/asf/incubator/giraph/tags/release-0.1-rc0/ Our signing keys (my key doesn't seem to be being picked up by http://people.apache.org/keys/group/giraph.asc): http://svn.apache.org/repos/asf/incubator/giraph/KEYS The vote runs for 72 hours, until Friday 4pm PST. After a successful vote here, Incubator will vote on the release as well. Thanks, Jakob
Re: Time to roll a release?
+1 -- Hyunsik Choi On Thu, Jan 5, 2012 at 8:57 PM, Sebastian Schelter s...@apache.org wrote: +1 from me too, as Jake already said: release early, release often. On 04.01.2012 23:07, Mattmann, Chris A (388J) wrote: Super +1, thanks for pushing this Jakob. Cheers, Chris On Jan 4, 2012, at 3:15 PM, Jakob Homan wrote: I think there's been enough work done since Giraph entered incubation that we're ready to do a release. We've had significant performance and usability improvements, to the point where anyone interested in Giraph/Pregal/BSP should definitely take a look at the code and try it out. Rolling a release would signal anyone left on the fence that it's worth their time. This is also a required criterion for advancing through the incubator, as we're doing well on the others currently. Having been peripherally involved in Kafka's recent first release, I can tell you it's quite a lot of paperwork, but I'm happy to volunteer to roll the first one. Any objections? Ideas? Hysterical laughter? -Jakob ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] [Commented] (GIRAPH-45) Improve the way to keep outgoing messages
[ https://issues.apache.org/jira/browse/GIRAPH-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13158131#comment-13158131 ] Hyunsik Choi commented on GIRAPH-45: Claudio, Thank for a nice suggestion. That seems a cool idea. However, I concern with the platform dependency of leveldb. The leveldb is built in C++ language. It may give us a burden of the distribution of Giraph. What does anyone else think? Improve the way to keep outgoing messages - Key: GIRAPH-45 URL: https://issues.apache.org/jira/browse/GIRAPH-45 Project: Giraph Issue Type: Improvement Components: bsp Reporter: Hyunsik Choi Assignee: Hyunsik Choi As discussed in GIRAPH-12(http://goo.gl/CE32U), I think that there is a potential problem to cause out of memory when the rate of message generation is higher than the rate of message flush (or network bandwidth). To overcome this problem, we need more eager strategy for message flushing or some approach to spill messages into disk. The below link is Dmitriy's suggestion. https://issues.apache.org/jira/browse/GIRAPH-12?focusedCommentId=13116253page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13116253 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] [Resolved] (GIRAPH-98) Add Claudio Martella to site
An option '-e ssh' may be needed. I've updated the wiki page. Congrats to join the PPMC member of Giraph! -- Hyunsik Choi On Sat, Nov 19, 2011 at 9:22 PM, Claudio Martella (Resolved) (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/GIRAPH-98?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] Claudio Martella resolved GIRAPH-98. Resolution: Fixed Fix Version/s: 0.70.0 I followed the instructions on the wiki and the rsync kept hanging on me after a while, I don't know why. So what I did was rsyncing towards ~/giraph instead of /www/incubator.apache.org/giraph/ and then cp -r * stuff from my ~/giraph (on people.apache.org) to it. It looks like it worked out, but I'm curious to understand why it hung. Add Claudio Martella to site Key: GIRAPH-98 URL: https://issues.apache.org/jira/browse/GIRAPH-98 Project: Giraph Issue Type: Task Reporter: Claudio Martella Assignee: Claudio Martella Fix For: 0.70.0 Attachments: GIRAPH-98.diff -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (GIRAPH-68) Implement a Graph Generator
[ https://issues.apache.org/jira/browse/GIRAPH-68?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyunsik Choi updated GIRAPH-68: --- Attachment: GIRAPH-68_2.patch Avery, Thank you for review. I think that the GraphGenerator is necessary to test the overall of IO-related sub systems. For example, *InputFormat and Partitioners can be examined by some generated data set instead of PseudoRandomVertexInputFormat. As you mentioned, I modified PageRank/RandomMessageBenchmark to use a specified InputFormat and an input path. If the input format and input path are not given, they will work as the current implementation using PseudoRandomVertexInputFormat. Implement a Graph Generator --- Key: GIRAPH-68 URL: https://issues.apache.org/jira/browse/GIRAPH-68 Project: Giraph Issue Type: New Feature Components: benchmark Affects Versions: 0.70.0 Reporter: Hyunsik Choi Assignee: Hyunsik Choi Attachments: GIRAPH-68_1.patch, GIRAPH-68_2.patch To provide users with benchmark environments and to deeply test the input/output system of giraph, we need a graph generator. We will enable the graph generator to generate various kinds of graph data sets by specifying a VertexInputFormat and a VertexOutputFormat. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-68) Implement a Graph Generator
[ https://issues.apache.org/jira/browse/GIRAPH-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13152185#comment-13152185 ] Hyunsik Choi commented on GIRAPH-68: I missed javadoc. I will reattach the patch including javadoc. Implement a Graph Generator --- Key: GIRAPH-68 URL: https://issues.apache.org/jira/browse/GIRAPH-68 Project: Giraph Issue Type: New Feature Components: benchmark Affects Versions: 0.70.0 Reporter: Hyunsik Choi Assignee: Hyunsik Choi Attachments: GIRAPH-68_1.patch, GIRAPH-68_2.patch To provide users with benchmark environments and to deeply test the input/output system of giraph, we need a graph generator. We will enable the graph generator to generate various kinds of graph data sets by specifying a VertexInputFormat and a VertexOutputFormat. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-77) Coordinator should expose a web interface with progress, vertex region assignments, etc.
[ https://issues.apache.org/jira/browse/GIRAPH-77?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13152583#comment-13152583 ] Hyunsik Choi commented on GIRAPH-77: I also think that this feature is necessary because we would not depend on MapReduce anymore after we port Giraph to Yarn. Coordinator should expose a web interface with progress, vertex region assignments, etc. Key: GIRAPH-77 URL: https://issues.apache.org/jira/browse/GIRAPH-77 Project: Giraph Issue Type: New Feature Reporter: Jakob Homan It would be nice if the coordinator worker had a web interface that showed progress, splits, etc. during job execution. Right now it would duplicate information currently being exposed through task status, but with the move to YARN, it will be a necessity. It would be great if we could do this in a modern way to avoid the screen-scraping, etc. currently used to get information from most other Hadoop project's web interfaces. The coordinator could announce its address at the beginning or via status updates. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-45) Improve the way to keep outgoing messages
[ https://issues.apache.org/jira/browse/GIRAPH-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151038#comment-13151038 ] Hyunsik Choi commented on GIRAPH-45: I'm in another time zone. I'm sad to miss the hot party. I consider this problem as Giraph becomes slow, but works well or Giraph cannot deal with some problems or data when the volume of generated messages exceeds the memory capacity. As you mentioned, apparently spilling data to disk is the simplest way to solve this problem. In addition, this way does not affect usual cases if spilling data is started only when the memory is getting tight. Anyway, the discussion is concluded as follows? - Each worker sends outgoing messages in an eager manner (immediately or periodically). - The receiving side spills incoming messages into disk only when the memory is getting tight. Avery, I also agree that storing partitions to disk is a good way to mitigate the memory problem. Also, I think that both ways are compatible and have different effects. The storing partitioning is more efficient if the volume of graph data is very large. Later, if Giraph enables users to choose the options (i.e., spilling, storing to partitions, or both), users can choose some of them according to their programs. Improve the way to keep outgoing messages - Key: GIRAPH-45 URL: https://issues.apache.org/jira/browse/GIRAPH-45 Project: Giraph Issue Type: Improvement Components: bsp Reporter: Hyunsik Choi Assignee: Hyunsik Choi As discussed in GIRAPH-12(http://goo.gl/CE32U), I think that there is a potential problem to cause out of memory when the rate of message generation is higher than the rate of message flush (or network bandwidth). To overcome this problem, we need more eager strategy for message flushing or some approach to spill messages into disk. The below link is Dmitriy's suggestion. https://issues.apache.org/jira/browse/GIRAPH-12?focusedCommentId=13116253page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13116253 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (GIRAPH-79) Change the menu layout of the site
Change the menu layout of the site -- Key: GIRAPH-79 URL: https://issues.apache.org/jira/browse/GIRAPH-79 Project: Giraph Issue Type: Task Components: site Reporter: Hyunsik Choi The current site has the basic menu layout generated by maven site plugin. This layout is restricted to embrace new contents. I would like to suggest the following menu layout. http://people.apache.org/~hyunsik/giraph/site/index.html Although the layout includes most existing contents, it has two addition categories, Giraph and Documentation. I think that this layout is simpler and is good to add new contents. Anyone has any other suggestions? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (GIRAPH-79) Change the menu layout of the site
[ https://issues.apache.org/jira/browse/GIRAPH-79?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyunsik Choi updated GIRAPH-79: --- Attachment: GIRAPH-79_1.patch Change the menu layout of the site -- Key: GIRAPH-79 URL: https://issues.apache.org/jira/browse/GIRAPH-79 Project: Giraph Issue Type: Task Components: site Reporter: Hyunsik Choi Labels: site Attachments: GIRAPH-79_1.patch The current site has the basic menu layout generated by maven site plugin. This layout is restricted to embrace new contents. I would like to suggest the following menu layout. http://people.apache.org/~hyunsik/giraph/site/index.html Although the layout includes most existing contents, it has two addition categories, Giraph and Documentation. I think that this layout is simpler and is good to add new contents. Anyone has any other suggestions? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-79) Change the menu layout of the site
[ https://issues.apache.org/jira/browse/GIRAPH-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149304#comment-13149304 ] Hyunsik Choi commented on GIRAPH-79: Gianmarco, I misunderstood your mention. You did not the remove of the project reports. I'm agree that the report is placed deeper in the site. Change the menu layout of the site -- Key: GIRAPH-79 URL: https://issues.apache.org/jira/browse/GIRAPH-79 Project: Giraph Issue Type: Task Components: site Reporter: Hyunsik Choi Labels: site Attachments: GIRAPH-79_1.patch The current site has the basic menu layout generated by maven site plugin. This layout is restricted to embrace new contents. I would like to suggest the following menu layout. http://people.apache.org/~hyunsik/giraph/site/index.html Although the layout includes most existing contents, it has two addition categories, Giraph and Documentation. I think that this layout is simpler and is good to add new contents. Anyone has any other suggestions? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph
[ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149412#comment-13149412 ] Hyunsik Choi commented on GIRAPH-11: Avery, I'm sorry for delaying the review. Now, I'm digging your patch. That looks great! Based on this work, we can consider some advanced graph partitioner based on the number of edge-cuts on graph partitions. I need about one more day for more investigation because the patch is somewhat complicated for me :) Besides, for the deeper review, I would like to execute the some tests and trace them. Your patch needs the rebase. Could you rebase the patch? Thank you :) Improve the graph distribution of Giraph Key: GIRAPH-11 URL: https://issues.apache.org/jira/browse/GIRAPH-11 Project: Giraph Issue Type: Improvement Affects Versions: 0.70.0 Reporter: Avery Ching Assignee: Avery Ching Attachments: GIRAPH-11.diff Currently, Giraph assumes that the data from the VertexInputFormat is sorted. If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset. This is often a bit inconvenient. Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach. The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user. Design goals for the graph distribution: * Allow vertices to be unordered or unordered * Ability to repartition * Select the partitioning scheme based on user needs (i.e. hash or range based) * Ability to provide user-specific hints about partitions Hash-based partitioning * Good vertex balancing across ranges for random data * Bad at vertex id locality Range-based partitioning * Good at vertex id locality * Ability to split ranges easily * Can cause hotspots for hot ranges -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: better way to update site?
+1 This way is the best for us :) -- Hyunsik Choi On Mon, Nov 14, 2011 at 12:48 PM, Jakob Homan jgho...@gmail.com wrote: Cool. I've got ahead and deleted the generated site from the repo and copied in the latest version (post GIRAPH-75). Thanks.
Re: better way to update site?
Thank you for the nice instruction. I've updated the rsync command for group permission. -- Hyunsik Choi On Mon, Nov 14, 2011 at 1:19 PM, Jakob Homan jgho...@gmail.com wrote: I've added a page to the wiki with instructions on how I did it: https://cwiki.apache.org/confluence/display/GIRAPH/Committer+notes On Sun, Nov 13, 2011 at 8:18 PM, Hyunsik Choi hyun...@apache.org wrote: +1 This way is the best for us :) -- Hyunsik Choi On Mon, Nov 14, 2011 at 12:48 PM, Jakob Homan jgho...@gmail.com wrote: Cool. I've got ahead and deleted the generated site from the repo and copied in the latest version (post GIRAPH-75). Thanks.
[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph
[ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149471#comment-13149471 ] Hyunsik Choi commented on GIRAPH-11: Thank you for rebase. Improve the graph distribution of Giraph Key: GIRAPH-11 URL: https://issues.apache.org/jira/browse/GIRAPH-11 Project: Giraph Issue Type: Improvement Affects Versions: 0.70.0 Reporter: Avery Ching Assignee: Avery Ching Attachments: GIRAPH-11.diff Currently, Giraph assumes that the data from the VertexInputFormat is sorted. If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset. This is often a bit inconvenient. Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach. The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user. Design goals for the graph distribution: * Allow vertices to be unordered or unordered * Ability to repartition * Select the partitioning scheme based on user needs (i.e. hash or range based) * Ability to provide user-specific hints about partitions Hash-based partitioning * Good vertex balancing across ranges for random data * Bad at vertex id locality Range-based partitioning * Good at vertex id locality * Ability to split ranges easily * Can cause hotspots for hot ranges -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph
[ https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13148187#comment-13148187 ] Hyunsik Choi commented on GIRAPH-11: That's a huge patch :) I have just started to explore your patch. I will leave some comments (maybe tomorrow). Improve the graph distribution of Giraph Key: GIRAPH-11 URL: https://issues.apache.org/jira/browse/GIRAPH-11 Project: Giraph Issue Type: Improvement Affects Versions: 0.70.0 Reporter: Avery Ching Assignee: Avery Ching Attachments: GIRAPH-11.diff Currently, Giraph assumes that the data from the VertexInputFormat is sorted. If the user data is not sorted by the vertex id, they must first run a MapReduce or Pig job to generate a sorted dataset. This is often a bit inconvenient. Giraph graph partitioning is currently range based and there are some advantages and disadvantages of this approach. The proposal of this JIRA would be to allow for both range and hash based partitioning and provide more flexibility to the user. Design goals for the graph distribution: * Allow vertices to be unordered or unordered * Ability to repartition * Select the partitioning scheme based on user needs (i.e. hash or range based) * Ability to provide user-specific hints about partitions Hash-based partitioning * Good vertex balancing across ranges for random data * Bad at vertex id locality Range-based partitioning * Good at vertex id locality * Ability to split ranges easily * Can cause hotspots for hot ranges -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-64) Create VertexRunner to make it easier to run users' computations
[ https://issues.apache.org/jira/browse/GIRAPH-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13146023#comment-13146023 ] Hyunsik Choi commented on GIRAPH-64: In my case, 'mvn package' is ok, but 'mvn assembly:assembly' incurs the error as I mentioned above. {code} hyunsik@code:~$ mvn --version Apache Maven 3.0.3 (r1075438; 2011-03-01 02:31:09+0900) Maven home: /home/hyunsik/Local/maven-3 Java version: 1.6.0_26, vendor: Sun Microsystems Inc. Java home: /usr/lib/jvm/java-6-sun-1.6.0.26/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 3.0.0-12-generic, arch: amd64, family: unix {code} Create VertexRunner to make it easier to run users' computations Key: GIRAPH-64 URL: https://issues.apache.org/jira/browse/GIRAPH-64 Project: Giraph Issue Type: New Feature Reporter: Jakob Homan Assignee: Jakob Homan Attachments: GIRAPH-64.patch Currently, if a user wants to implement a Giraph algorithm by extending {{Vertex}} they must also write all the boilerplate around the {{Tool}} interface and bundle it with the Giraph jar (or get Giraph on the classpath and playing nice with the implementation). For example, what is included in the PageRankBenchmark and what Kohei has done: https://github.com/smly/java-Giraph-LabelPropagation It would be better if we had perhaps a Vertex implementation to be subclassed that already had all the standard Tooling included such that all one had to run would be (assuming the Giraph jar was already on the classpath): {noformat}hadoop jar my-awesome-vertex.jar my.awesome.vertex -i jazz_input -o jazz_output -if org.apache.giraph.lib.in.text.adjacency-list.LongDoubleDouble -of org.apache.giraph.lib.out.text.adjacency-list.LongDoubleDouble{noformat} This wouldn't work with every algorithm, but would be useful in a large number of cases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-36) Ensure that subclassing BasicVertex is possible by user apps
[ https://issues.apache.org/jira/browse/GIRAPH-36?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13139976#comment-13139976 ] Hyunsik Choi commented on GIRAPH-36: Looks great! Actually, I need more time to fully keep up with this patch. First of all, I have executed unit tests on real hadoop cluster running on local host. All tests are passed! Ensure that subclassing BasicVertex is possible by user apps Key: GIRAPH-36 URL: https://issues.apache.org/jira/browse/GIRAPH-36 Project: Giraph Issue Type: Improvement Components: graph Affects Versions: 0.70.0 Reporter: Jake Mannix Assignee: Jake Mannix Priority: Blocker Fix For: 0.70.0 Attachments: GIRAPH-36.diff Original assumptions in Giraph were that all users would subclass Vertex (which extended MutableVertex extended BasicVertex). Classes which wish to have application specific data structures (ie. not a TreeMapI, EdgeI,E) may need to extend either MutableVertex or BasicVertex. Unfortunately VertexRange extends ArrayListVertex, and there are other places where the assumption is that vertex classes are either Vertex, or at least MutableVertex. Let's make sure the internal APIs allow for BasicVertex to be the base class. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-13) Port Giraph to YARN
[ https://issues.apache.org/jira/browse/GIRAPH-13?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13136877#comment-13136877 ] Hyunsik Choi commented on GIRAPH-13: Probably, there are many prerequisite and difficult issues. I'm willing to wait for your update :) Port Giraph to YARN --- Key: GIRAPH-13 URL: https://issues.apache.org/jira/browse/GIRAPH-13 Project: Giraph Issue Type: New Feature Reporter: Jakob Homan Assignee: Jakob Homan Now that YARN (aka MR2 aka MAPREDUCE-279) has been merged into the Hadoop trunk, we should think about what it would take to separate out the graph processing bits of Giraph from the MR1-specific code so as to take advantage of the less-MR centric aspects of YARN, while still supporting both over the medium term. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Relax RTC on web site commits?
+1 On Thu, Oct 27, 2011 at 9:13 AM, Jakob Homan jgho...@gmail.com wrote: Currently we're doing individual JIRAs for each change to the website, which is a bunch of ceremony for a routine matter. In GIRAPH-66, we discussed relaxing this requirement for website changes. This is an approach we've used in other projects and it has worked. I'm +1. Thoughts?
[jira] [Commented] (GIRAPH-12) Investigate communication improvements
[ https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13120604#comment-13120604 ] Hyunsik Choi commented on GIRAPH-12: Thank you for review. I agree with your opinion. The virtual memory size seems very important in 32-bit JVMs. I only considered 64-bit JVMs. I overlooked that point. Anyway, this patch allows users to control the number of threads. It is more helpful in restricted environment (e.g., 32-bit JVM). Investigate communication improvements -- Key: GIRAPH-12 URL: https://issues.apache.org/jira/browse/GIRAPH-12 Project: Giraph Issue Type: Improvement Components: bsp Reporter: Avery Ching Assignee: Hyunsik Choi Priority: Minor Attachments: GIRAPH-12_1.patch, GIRAPH-12_2.patch, GIRAPH-12_3.patch Currently every worker will start up a thread to communicate with every other workers. Hadoop RPC is used for communication. For instance if there are 400 workers, each worker will create 400 threads. This ends up using a lot of memory, even with the option -Dmapred.child.java.opts=-Xss64k. It would be good to investigate using frameworks like Netty or custom roll our own to improve this situation. By moving away from Hadoop RPC, we would also make compatibility of different Hadoop versions easier. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (GIRAPH-12) Investigate communication improvements
[ https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyunsik Choi updated GIRAPH-12: --- Attachment: GIRAPH-12_3.patch Avery, Thank you for your comments. Yes, I agreed that the aggregated wasted memory should be considered =) In the latest comment, I investigated the real occupation of thread stack by using Sleep class (https://gist.github.com/1249761). When I created 2000 Sleep threads with the default stack size option '-Xss4096k', the memory usage of both the process and all threads is only 46 mega bytes. So, I would like to say that the individual thread consumes much less stack size than default thread stack size. The default thread stack size affects the virtual memory area size. It is not resident memory size. The actual stack size per thread seems to be only affected by local variables and function invocations. As a result, I guess that the memory problem is usually caused by outgoing messages kept in memory =) Anyway, I attach the patch. The main difference from the previous patch is that the default number of thread pool is set to the number of workers - 1 if unset. Besides, I added more comments. The unit tests are passed against the real hadoop cluster. Investigate communication improvements -- Key: GIRAPH-12 URL: https://issues.apache.org/jira/browse/GIRAPH-12 Project: Giraph Issue Type: Improvement Components: bsp Reporter: Avery Ching Assignee: Hyunsik Choi Priority: Minor Attachments: GIRAPH-12_1.patch, GIRAPH-12_2.patch, GIRAPH-12_3.patch Currently every worker will start up a thread to communicate with every other workers. Hadoop RPC is used for communication. For instance if there are 400 workers, each worker will create 400 threads. This ends up using a lot of memory, even with the option -Dmapred.child.java.opts=-Xss64k. It would be good to investigate using frameworks like Netty or custom roll our own to improve this situation. By moving away from Hadoop RPC, we would also make compatibility of different Hadoop versions easier. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Unit tests on real hadoop cluster
Avery and Jake, Thank you for your comments =) I executed the unit test on the hadoop cluster that installed Hadoop cdh 0.20.2+923.97. On this cluster, many of the unit tests are failed because MR jobs could not be submitted to that cluster. I didn't investigate this problem. Is here my mistake? If this is unknown problem, I'll create a jira issue about that. Anyway, I installed hadoop 0.20.203 to my local machine. All unit tests work fine =) -- Hyunsik Choi Database Lab, Korea University On Fri, Sep 30, 2011 at 2:16 PM, Jakob Homan jgho...@gmail.com wrote: I've run them on our 20x cluster with no problems, but with a local ZK, not a specified instance. On Thu, Sep 29, 2011 at 10:05 PM, Avery Ching ach...@apache.org wrote: Actually, to be fair, I've only executed the distributed unittests on my own local Hadoop instance. I just ran the Hadoop unittests against trunk on my local machine to check mvn test -Dprop.mapred.job.tracker=localhost:50300 snip Results : Tests run: 27, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 12:19.143s [INFO] Finished at: Thu Sep 29 21:55:55 PDT 2011 [INFO] Final Memory: 6M/81M [INFO] Everything should be fine. Avery On 9/29/11 5:18 PM, Hyunsik Choi wrote: I would like to execute unittest on real hadoop cluster. I try to execute the following command against giraph trunk version. mvn test -Dprop.mapred.job.tracker=xxx.korea.ac.kr:8021 -Dprop.zookeeper.list=xxx.korea.ac.kr:2181 However, the unit tests are failed as follows: https://gist.github.com/1252309 I think It may be my fault because the source code is trunk version. Any suggestion to this will be helpful. -- Hyunsik Choi Database Lab, Korea University
[jira] [Commented] (GIRAPH-12) Investigate communication improvements
[ https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116960#comment-13116960 ] Hyunsik Choi commented on GIRAPH-12: Dmitriy, Thank you for your comments. Regardless of the problem caused by thread stack size, those approaches look promising. Especially, spilling messages to disk looks necessary so that Giraph deals with really large graph data. Otherwise, out of memory may occur when the message generating rate are higher than network bandwidth. I'll open a separate issue about this. Investigate communication improvements -- Key: GIRAPH-12 URL: https://issues.apache.org/jira/browse/GIRAPH-12 Project: Giraph Issue Type: Improvement Components: bsp Reporter: Avery Ching Assignee: Hyunsik Choi Priority: Minor Attachments: GIRAPH-12_1.patch, GIRAPH-12_2.patch Currently every worker will start up a thread to communicate with every other workers. Hadoop RPC is used for communication. For instance if there are 400 workers, each worker will create 400 threads. This ends up using a lot of memory, even with the option -Dmapred.child.java.opts=-Xss64k. It would be good to investigate using frameworks like Netty or custom roll our own to improve this situation. By moving away from Hadoop RPC, we would also make compatibility of different Hadoop versions easier. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-12) Investigate communication improvements
[ https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116961#comment-13116961 ] Hyunsik Choi commented on GIRAPH-12: Avery, Thank you for your review. You are right. Runtime's totalMem() and freeMem() methods doesn't measure stack sizes. I'm sure of it after testing the below code. https://gist.github.com/1249761 I have looked for how to measure the stack size of a java application. I could not find about that. Still, I'm not sure how to show that thread stack memory is reduced by the thread pool approach. Now, your way seems a only method to prove them. However, I'm curious to know how much thread overhead is in terms of memory consumption. Before I try your approach. I conducted some simple experiments. I used the above source code to investigate the memory usage of threads. This is executed on a machine with intel i3, ubuntu 11.10 (64bit), and 8G memory. I measure their memory by using 'top'. 'top' shows several columns including VIRT and RES, and SHR. We only need to focus RES, resident memory. RES includes all resident memory usages, such as heap and stack. I could know this from this page (http://goo.gl/JE7fD). Firstly, I executed the above code with 1000 threads and without a jvm option '-Xss'. Accoring to this page (http://goo.gl/sz2qM), the default stack size 'Xss' is 1024k on the jvm of 64bit linux. After all threads are created, I executed 'top' to print the memory usages as follows: 1k threads with default thread stack size. {noformat} VIRT RES SHR 9163 hyunsik 20 0 3366m 30m 8296 S 18 0.4 0:01.52 java {noformat} 2k threads with default thread stack size. {noformat} VIRT RES SHR 11223 hyunsik 20 0 4434m 46m 8340 S 40 0.6 0:04.11 java {noformat} With 1k and 2k threads, that program consumes only 30 and 46 mega bytes respectively. The memory usage of threads are smaller than I expected. I wonder if thread stack size is the main cause of the memory problem that we have faced. Besides, the default stack size is 1024k. The thread stack size seems to not affect RES. I had more tests with 'Xss' in order to investigate more the thread stack size. 1k threads with '-Xss4096k'. {noformat} 28301 hyunsik 20 0 6380m 30m 8292 S 17 0.4 0:05.25 java {noformat} 2k threads with '-Xss4096k' {noformat} 29326 hyunsik 20 0 10.1g 46m 8300 S 38 0.6 0:03.42 java {noformat} VIRT surely is affected by '-Xss', but RES is not. 'Xss' seems the maximum stack size of each thread because it doesn't affect RES. What do you think about that? Investigate communication improvements -- Key: GIRAPH-12 URL: https://issues.apache.org/jira/browse/GIRAPH-12 Project: Giraph Issue Type: Improvement Components: bsp Reporter: Avery Ching Assignee: Hyunsik Choi Priority: Minor Attachments: GIRAPH-12_1.patch, GIRAPH-12_2.patch Currently every worker will start up a thread to communicate with every other workers. Hadoop RPC is used for communication. For instance if there are 400 workers, each worker will create 400 threads. This ends up using a lot of memory, even with the option -Dmapred.child.java.opts=-Xss64k. It would be good to investigate using frameworks like Netty or custom roll our own to improve this situation. By moving away from Hadoop RPC, we would also make compatibility of different Hadoop versions easier. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-12) Investigate communication improvements
[ https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116166#comment-13116166 ] Hyunsik Choi commented on GIRAPH-12: Avery, Thank you for your comments. I decided to use Runtime. It seems to be enough to investigate this issue. Again, I conducted a benchmark to measure memory consumption with RandomMessageBenchmark as follows: {noformat} hadoop jar giraph-0.70-jar-with-dependencies.jar org.apache.giraph.benchmark.RandomMessageBenchmark -e 2 -s 3 -w 20 -b 4 -n 150 -V ${V} -v -f ${f} {noformat} , where 'f' option indicates the number of threads of thread pool. And, I changed the the thread executor as FixedThreadPool class. I conducted two times for every experiment and I got the average of them. You can see the results from the below link: http://goo.gl/arP62 This experiments was conducted in two cluaster nodes, each of which has 24 cores and 64GB mem. They are connected each other over 1Gbps ethernet. I measured the memory footprints from Runtime in GraphMapper as Avery recommended. In sum, the thread pool approach is better than original approach in terms of processing times. I guess that this is because the thread pool approach reduces the context switching cost and narrow the synchronization area. Unfortunately, however, the thread pool approach doesn't reduce the memory consumption. This is the main focus of this issue. Rather, this approach needs slightly more memory as shown in Figure 3 and 4. However, we need to note the experiments with f = 5 and f = 20. In these experiments, the number of threads has small effect on the memory consumption. We have faced the memory problem. We may need to approach this problem from another aspect. I think that this problem may be mainly caused by the current message flushing strategy. In current implementation, outgoing messages are transmitted to other peers by only two cases: 1) When the number of outgoing messages for a specific peer exceeds the a threshold (i.e., maxSize), the outgoing messages for the peer are transmitted to the peer. 2) When one super step is finished, the entire messages are flushed to other peers. Flush (case 2) is only triggered at the end of superstep. During processing, the message flushing only depends on the case 1. This may be not effective because the case 1 only consider the the number of messages for each specific peer. It never take account of the real memory occupation. If destinations of outgoing messages are uniform, out of memory may occur before any 'case 1' is triggered. To overcome this problem, we may need more eager message flushing strategy or some approach to store overflow messages into disk. Let me know what you think. Investigate communication improvements -- Key: GIRAPH-12 URL: https://issues.apache.org/jira/browse/GIRAPH-12 Project: Giraph Issue Type: Improvement Components: bsp Reporter: Avery Ching Assignee: Hyunsik Choi Priority: Minor Attachments: GIRAPH-12_1.patch, GIRAPH-12_2.patch Currently every worker will start up a thread to communicate with every other workers. Hadoop RPC is used for communication. For instance if there are 400 workers, each worker will create 400 threads. This ends up using a lot of memory, even with the option -Dmapred.child.java.opts=-Xss64k. It would be good to investigate using frameworks like Netty or custom roll our own to improve this situation. By moving away from Hadoop RPC, we would also make compatibility of different Hadoop versions easier. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (GIRAPH-42) The MapReduce counter 'Sent Messages' doesn't work.
The MapReduce counter 'Sent Messages' doesn't work. --- Key: GIRAPH-42 URL: https://issues.apache.org/jira/browse/GIRAPH-42 Project: Giraph Issue Type: Bug Components: bsp Reporter: Hyunsik Choi Priority: Minor The MapReduce counter 'Sent Messages' doesn't work. It always shows 0. {noformat} . . 11/09/28 10:51:22 INFO mapred.JobClient: Current workers=20 11/09/28 10:51:22 INFO mapred.JobClient: Current master task partition=0 11/09/28 10:51:22 INFO mapred.JobClient: Sent messages=0 11/09/28 10:51:22 INFO mapred.JobClient: Aggregate finished vertices=60 11/09/28 10:51:22 INFO mapred.JobClient: Aggregate vertices=60 . . {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-12) Investigate communication improvements
[ https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114646#comment-13114646 ] Hyunsik Choi commented on GIRAPH-12: I'm sorry too for late response. I was out of town due to my personal work. I just come to home. The previous experiments are too simple. Actually, that experiment cannot show any meaningful result. I sorry for that. As to the question 3, this issue was originated from the memory usage. I should have measured the memory usage. Sooner, I'll answer your 3 questions :) Investigate communication improvements -- Key: GIRAPH-12 URL: https://issues.apache.org/jira/browse/GIRAPH-12 Project: Giraph Issue Type: Improvement Components: bsp Reporter: Avery Ching Assignee: Hyunsik Choi Priority: Minor Attachments: GIRAPH-12_1.patch, GIRAPH-12_2.patch Currently every worker will start up a thread to communicate with every other workers. Hadoop RPC is used for communication. For instance if there are 400 workers, each worker will create 400 threads. This ends up using a lot of memory, even with the option -Dmapred.child.java.opts=-Xss64k. It would be good to investigate using frameworks like Netty or custom roll our own to improve this situation. By moving away from Hadoop RPC, we would also make compatibility of different Hadoop versions easier. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-12) Investigate communication improvements
[ https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114709#comment-13114709 ] Hyunsik Choi commented on GIRAPH-12: I have thought about question 3. That is, how we can measure the memory usage while Giraph is running. Probably, the most basic way is to use the hadoop metrics (http://www.cloudera.com/blog/2009/03/hadoop-metrics/). However, this way needs to change _hadoop-metrics.properties_ file. So, it may be restricted for most large clusters; e.g., Yahoo! cluster that Avery can access. If the above way is impossible, we can implement a thread class mimic to hadoop metric in order to measure the memory usage on JVM periodically and sends that to a specific remote server. What do you think about that? Investigate communication improvements -- Key: GIRAPH-12 URL: https://issues.apache.org/jira/browse/GIRAPH-12 Project: Giraph Issue Type: Improvement Components: bsp Reporter: Avery Ching Assignee: Hyunsik Choi Priority: Minor Attachments: GIRAPH-12_1.patch, GIRAPH-12_2.patch Currently every worker will start up a thread to communicate with every other workers. Hadoop RPC is used for communication. For instance if there are 400 workers, each worker will create 400 threads. This ends up using a lot of memory, even with the option -Dmapred.child.java.opts=-Xss64k. It would be good to investigate using frameworks like Netty or custom roll our own to improve this situation. By moving away from Hadoop RPC, we would also make compatibility of different Hadoop versions easier. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (GIRAPH-12) Investigate communication improvements
[ https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyunsik Choi updated GIRAPH-12: --- Attachment: GIRAPH-12_2.patch I attach the second patch. I have benchmarked this patch via GIRAPH-32. The results are shown as the below. In the results, the improved version is slightly better than current implementation. As Avery mentioned, the improved one makes threads controllable, so it is an improve. Users can adjust the number of core threads and max threads by using GiraphJob's constants, such as MSG_FLUSHER_CORE_SIZE and MSG_FLUSHER_MAX_SIZE. This setting can affect the performance. So, we may need to guide users to find the best parameters. But, this experiment may be not enough to evaluate this approach because this experiment is conducted in small cluster. *the result of original version* {noformat} org.apache.giraph.benchmark.RandomMessageBenchmark -e 2 -s 3 -w 6 -b 4 -n 150 -V 30 -v = 1st = 11/09/22 00:55:06 INFO mapred.JobClient: Total (milliseconds)=63096 11/09/22 00:55:06 INFO mapred.JobClient: Superstep 3 (milliseconds)=551 11/09/22 00:55:06 INFO mapred.JobClient: Setup (milliseconds)=1331 11/09/22 00:55:06 INFO mapred.JobClient: Shutdown (milliseconds)=1008 11/09/22 00:55:06 INFO mapred.JobClient: Vertex input superstep (milliseconds)=516 11/09/22 00:55:06 INFO mapred.JobClient: Superstep 0 (milliseconds)=16079 11/09/22 00:55:06 INFO mapred.JobClient: Superstep 2 (milliseconds)=25657 11/09/22 00:55:06 INFO mapred.JobClient: Superstep 1 (milliseconds)=17950 = 2rd = 11/09/22 00:58:13 INFO mapred.JobClient: Total (milliseconds)=62771 11/09/22 00:58:13 INFO mapred.JobClient: Superstep 3 (milliseconds)=600 11/09/22 00:58:13 INFO mapred.JobClient: Setup (milliseconds)=1290 11/09/22 00:58:13 INFO mapred.JobClient: Shutdown (milliseconds)=950 11/09/22 00:58:13 INFO mapred.JobClient: Vertex input superstep (milliseconds)=614 11/09/22 00:58:13 INFO mapred.JobClient: Superstep 0 (milliseconds)=15654 11/09/22 00:58:13 INFO mapred.JobClient: Superstep 2 (milliseconds)=25157 11/09/22 00:58:13 INFO mapred.JobClient: Superstep 1 (milliseconds)=18499 {noformat} *the result of patched version* {noformat} = 1st = 11/09/22 00:59:41 INFO mapred.JobClient: Total (milliseconds)=60068 11/09/22 00:59:41 INFO mapred.JobClient: Superstep 3 (milliseconds)=542 11/09/22 00:59:41 INFO mapred.JobClient: Setup (milliseconds)=1219 11/09/22 00:59:41 INFO mapred.JobClient: Shutdown (milliseconds)=1025 11/09/22 00:59:41 INFO mapred.JobClient: Vertex input superstep (milliseconds)=616 11/09/22 00:59:41 INFO mapred.JobClient: Superstep 0 (milliseconds)=15887 11/09/22 00:59:41 INFO mapred.JobClient: Superstep 2 (milliseconds)=23149 11/09/22 00:59:41 INFO mapred.JobClient: Superstep 1 (milliseconds)=17626 = 2rd = 11/09/22 01:01:05 INFO mapred.JobClient: Total (milliseconds)=60359 11/09/22 01:01:05 INFO mapred.JobClient: Superstep 3 (milliseconds)=510 11/09/22 01:01:05 INFO mapred.JobClient: Setup (milliseconds)=1399 11/09/22 01:01:05 INFO mapred.JobClient: Shutdown (milliseconds)=956 11/09/22 01:01:05 INFO mapred.JobClient: Vertex input superstep (milliseconds)=550 11/09/22 01:01:05 INFO mapred.JobClient: Superstep 0 (milliseconds)=16054 11/09/22 01:01:05 INFO mapred.JobClient: Superstep 2 (milliseconds)=23049 11/09/22 01:01:05 INFO mapred.JobClient: Superstep 1 (milliseconds)=17835 {noformat} Investigate communication improvements -- Key: GIRAPH-12 URL: https://issues.apache.org/jira/browse/GIRAPH-12 Project: Giraph Issue Type: Improvement Components: bsp Reporter: Avery Ching Assignee: Hyunsik Choi Priority: Minor Attachments: GIRAPH-12_1.patch, GIRAPH-12_2.patch Currently every worker will start up a thread to communicate with every other workers. Hadoop RPC is used for communication. For instance if there are 400 workers, each worker will create 400 threads. This ends up using a lot of memory, even with the option -Dmapred.child.java.opts=-Xss64k. It would be good to investigate using frameworks like Netty or custom roll our own to improve this situation. By moving away from Hadoop RPC, we would also make compatibility of different Hadoop versions easier. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (GIRAPH-32) Implement benchmarks to evaluate the performance of message passing
[ https://issues.apache.org/jira/browse/GIRAPH-32?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyunsik Choi resolved GIRAPH-32. Resolution: Fixed Because this issue got +1, I just committed. Implement benchmarks to evaluate the performance of message passing Key: GIRAPH-32 URL: https://issues.apache.org/jira/browse/GIRAPH-32 Project: Giraph Issue Type: Task Components: benchmark Reporter: Hyunsik Choi Assignee: Hyunsik Choi Fix For: 0.70.0 Attachments: GIRAPH-32.patch, GIRAPH-32_2.patch Message passing framework plays an important role in Giraph. We need some benchmark programs to evaluate the improvement related to message passing method. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (GIRAPH-32) Implement benchmarks to evaluate the performance of message passing
[ https://issues.apache.org/jira/browse/GIRAPH-32?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyunsik Choi updated GIRAPH-32: --- Attachment: GIRAPH-32_2.patch Good idea! According to which InputFormat we use, we could choose the distribution of destination vertices. I attach the patch that corrected coding convention. Implement benchmarks to evaluate the performance of message passing Key: GIRAPH-32 URL: https://issues.apache.org/jira/browse/GIRAPH-32 Project: Giraph Issue Type: Task Components: benchmark Reporter: Hyunsik Choi Assignee: Hyunsik Choi Fix For: 0.70.0 Attachments: GIRAPH-32.patch, GIRAPH-32_2.patch Message passing framework plays an important role in Giraph. We need some benchmark programs to evaluate the improvement related to message passing method. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (GIRAPH-32) Implement benchmarks to evaluate the performance of message passing
[ https://issues.apache.org/jira/browse/GIRAPH-32?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyunsik Choi updated GIRAPH-32: --- Attachment: GIRAPH-32.patch I attach the patch about this issue. This patch includes a benchmark class. In this benchmark, for each vertex, the compute function sends a meaningless message into all edges of the vertex. Actually, I intend this benchmark to send messages into random workers. PseudoRandomVertexInputFormat already generates random edges. I employed it. This benchmark allows users to set the size of message bytes and the number of sending messages per edge. This is because I think they are basic factors to evaluate the behavior and performance of some message delivery system. Besides, users can adjust the number of edges per vertex rather than adjusting the number of sending messages per. It allows users to make the sending pattern either more spread or more skewed. Anyone can review this? Implement benchmarks to evaluate the performance of message passing Key: GIRAPH-32 URL: https://issues.apache.org/jira/browse/GIRAPH-32 Project: Giraph Issue Type: Task Components: benchmark Reporter: Hyunsik Choi Assignee: Hyunsik Choi Fix For: 0.70.0 Attachments: GIRAPH-32.patch Message passing framework plays an important role in Giraph. We need some benchmark programs to evaluate the improvement related to message passing method. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-12) Investigate communication improvements
[ https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13107061#comment-13107061 ] Hyunsik Choi commented on GIRAPH-12: (a note for sharing) Graph mutation functions (e.g., addVertexRequest, addEdgeRequest..) directly invoke RPC functions. This approach incurs RPC round-trip overheads during processing. Especially when many workers try to mutate vertices or edges, synchronization overheads may also occur in receiving sides. It may be severe as the size of cluster increases. If we change graph mutation API to asynchronous messages, it would be more efficient. If possible, graph mutation messages and value messages (i.e., sendMsg) can be integrated into one message passing API. Investigate communication improvements -- Key: GIRAPH-12 URL: https://issues.apache.org/jira/browse/GIRAPH-12 Project: Giraph Issue Type: Improvement Components: bsp Reporter: Avery Ching Assignee: Hyunsik Choi Priority: Minor Attachments: GIRAPH-12_1.patch Currently every worker will start up a thread to communicate with every other workers. Hadoop RPC is used for communication. For instance if there are 400 workers, each worker will create 400 threads. This ends up using a lot of memory, even with the option -Dmapred.child.java.opts=-Xss64k. It would be good to investigate using frameworks like Netty or custom roll our own to improve this situation. By moving away from Hadoop RPC, we would also make compatibility of different Hadoop versions easier. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-12) Investigate communication improvements
[ https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13107063#comment-13107063 ] Hyunsik Choi commented on GIRAPH-12: (a note for sharing) In current implementation, outgoing messages are sent to other peers in only two triggers: 1) When the number of outgoing messages for a specific peer exceeds the a threshold (i.e., maxSize), the outgoing messages for the peer are transmitted to the peer. 2) When one super step is finished, the entire messages are flushed to other peers. In the case 1, however, the current implementation only consider the number of messages instead of the size of messages. The outgoing messages reside in main memory until they are sent to other peers. It is another important factor to consume main memory. It would be good to consider not only the number of messages but also the size of messages. Investigate communication improvements -- Key: GIRAPH-12 URL: https://issues.apache.org/jira/browse/GIRAPH-12 Project: Giraph Issue Type: Improvement Components: bsp Reporter: Avery Ching Assignee: Hyunsik Choi Priority: Minor Attachments: GIRAPH-12_1.patch Currently every worker will start up a thread to communicate with every other workers. Hadoop RPC is used for communication. For instance if there are 400 workers, each worker will create 400 threads. This ends up using a lot of memory, even with the option -Dmapred.child.java.opts=-Xss64k. It would be good to investigate using frameworks like Netty or custom roll our own to improve this situation. By moving away from Hadoop RPC, we would also make compatibility of different Hadoop versions easier. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-35) Modifying the site to indicated that Jake Mannix and Dmitriy Ryaboy are now Giraph committers
[ https://issues.apache.org/jira/browse/GIRAPH-35?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13105795#comment-13105795 ] Hyunsik Choi commented on GIRAPH-35: +1 Welcome new committers :) Modifying the site to indicated that Jake Mannix and Dmitriy Ryaboy are now Giraph committers - Key: GIRAPH-35 URL: https://issues.apache.org/jira/browse/GIRAPH-35 Project: Giraph Issue Type: Task Reporter: Avery Ching Assignee: Avery Ching Attachments: GIRAPH-35.patch -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-12) Investigate communication improvements
[ https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13104340#comment-13104340 ] Hyunsik Choi commented on GIRAPH-12: You mean that we need some benchmark program to test the performance and scalability of message passing methods. If so, I'll add two benchmarking programs, which are sending messages to peers in random and skewed distribution respectively. For this, I'll create another issue. Let me know what you think :) Investigate communication improvements -- Key: GIRAPH-12 URL: https://issues.apache.org/jira/browse/GIRAPH-12 Project: Giraph Issue Type: Improvement Components: bsp Reporter: Avery Ching Assignee: Hyunsik Choi Priority: Minor Attachments: GIRAPH-12_1.patch Currently every worker will start up a thread to communicate with every other workers. Hadoop RPC is used for communication. For instance if there are 400 workers, each worker will create 400 threads. This ends up using a lot of memory, even with the option -Dmapred.child.java.opts=-Xss64k. It would be good to investigate using frameworks like Netty or custom roll our own to improve this situation. By moving away from Hadoop RPC, we would also make compatibility of different Hadoop versions easier. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (GIRAPH-32) Implement benchmarks to evaluate the performance of message passing
Implement benchmarks to evaluate the performance of message passing Key: GIRAPH-32 URL: https://issues.apache.org/jira/browse/GIRAPH-32 Project: Giraph Issue Type: Task Components: benchmark Reporter: Hyunsik Choi Assignee: Hyunsik Choi Fix For: 0.70.0 Message passing framework plays an important role in Giraph. We need some benchmark programs to evaluate the improvement related to message passing method. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (GIRAPH-33) Missing license header of GraphState.java
Missing license header of GraphState.java - Key: GIRAPH-33 URL: https://issues.apache.org/jira/browse/GIRAPH-33 Project: Giraph Issue Type: Task Components: graph Reporter: Hyunsik Choi Priority: Trivial Fix For: 0.70.0 GraphState.java doesn't contain apache license header. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (GIRAPH-33) Missing license header of GraphState.java
[ https://issues.apache.org/jira/browse/GIRAPH-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyunsik Choi updated GIRAPH-33: --- Attachment: GIRAPH-33.patch This patch adds apache license header. Missing license header of GraphState.java - Key: GIRAPH-33 URL: https://issues.apache.org/jira/browse/GIRAPH-33 Project: Giraph Issue Type: Task Components: graph Reporter: Hyunsik Choi Priority: Trivial Fix For: 0.70.0 Attachments: GIRAPH-33.patch GraphState.java doesn't contain apache license header. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (GIRAPH-33) Missing license header of GraphState.java
[ https://issues.apache.org/jira/browse/GIRAPH-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyunsik Choi resolved GIRAPH-33. Resolution: Fixed This is a trivial fix. I just committed. Missing license header of GraphState.java - Key: GIRAPH-33 URL: https://issues.apache.org/jira/browse/GIRAPH-33 Project: Giraph Issue Type: Task Components: graph Reporter: Hyunsik Choi Priority: Trivial Fix For: 0.70.0 Attachments: GIRAPH-33.patch GraphState.java doesn't contain apache license header. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-12) Investigate communication improvements
[ https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13102682#comment-13102682 ] Hyunsik Choi commented on GIRAPH-12: Like the current PeerThread, initially each PeerConnection gets one established RPC proxy. These connections are kept during whole processing. So, there is no connection overhead. If you test this code on Yahoo!'s clusters, I'll appreciate your help. And, next week I can access to my lab's hadoop cluster. At that time, I'll also do some tests. Investigate communication improvements -- Key: GIRAPH-12 URL: https://issues.apache.org/jira/browse/GIRAPH-12 Project: Giraph Issue Type: Improvement Components: bsp Reporter: Avery Ching Assignee: Hyunsik Choi Priority: Minor Attachments: GIRAPH-12_1.patch Currently every worker will start up a thread to communicate with every other workers. Hadoop RPC is used for communication. For instance if there are 400 workers, each worker will create 400 threads. This ends up using a lot of memory, even with the option -Dmapred.child.java.opts=-Xss64k. It would be good to investigate using frameworks like Netty or custom roll our own to improve this situation. By moving away from Hadoop RPC, we would also make compatibility of different Hadoop versions easier. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (GIRAPH-12) Investigate communication improvements
[ https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyunsik Choi updated GIRAPH-12: --- Attachment: GIRAPH-12_1.patch As Avery mentioned, in the current architecture, each worker requires N threads that communicate with N remote peers. This may incur severe context-switching overheads (especially when all messages are flushed) and more memory consumption. Firstly, I considered about replacing RPC system to another one. However, it is not simple work. I need more time. Instead, I have considered an alternative way to employ ThreadPoolExecutor in order to adjust active threads. When Giraph deals with large graphs, the performance of Giraph is usually bounded on network bandwidth. I think that this approach would be effective. In addition, I tried to reduce the synchronization area, where BasicRPCCommunicator (374-394 lines) sends large buffered messages to specific peers. I attached the patch in progress. Now, I cannot access to real hadoop cluster for one week. I didn't test this in real cluster. Besides, all unit test are passed. How about this approach? Could you review this? Investigate communication improvements -- Key: GIRAPH-12 URL: https://issues.apache.org/jira/browse/GIRAPH-12 Project: Giraph Issue Type: Improvement Reporter: Avery Ching Assignee: Hyunsik Choi Priority: Minor Attachments: GIRAPH-12_1.patch Currently every worker will start up a thread to communicate with every other workers. Hadoop RPC is used for communication. For instance if there are 400 workers, each worker will create 400 threads. This ends up using a lot of memory, even with the option -Dmapred.child.java.opts=-Xss64k. It would be good to investigate using frameworks like Netty or custom roll our own to improve this situation. By moving away from Hadoop RPC, we would also make compatibility of different Hadoop versions easier. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (GIRAPH-12) Investigate communication improvements
[ https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyunsik Choi updated GIRAPH-12: --- Component/s: bsp Investigate communication improvements -- Key: GIRAPH-12 URL: https://issues.apache.org/jira/browse/GIRAPH-12 Project: Giraph Issue Type: Improvement Components: bsp Reporter: Avery Ching Assignee: Hyunsik Choi Priority: Minor Attachments: GIRAPH-12_1.patch Currently every worker will start up a thread to communicate with every other workers. Hadoop RPC is used for communication. For instance if there are 400 workers, each worker will create 400 threads. This ends up using a lot of memory, even with the option -Dmapred.child.java.opts=-Xss64k. It would be good to investigate using frameworks like Netty or custom roll our own to improve this situation. By moving away from Hadoop RPC, we would also make compatibility of different Hadoop versions easier. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-29) Implement TextVertexInputFormat for text-format graph data
[ https://issues.apache.org/jira/browse/GIRAPH-29?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13102111#comment-13102111 ] Hyunsik Choi commented on GIRAPH-29: I'm sorry for my big mistake. I overlookd org.apache.giraph.lib package. I have a question. When a program use TextVertexInputFormat, the active workers are determined by the number of blocks? How does giraph work when the blocks are more than numWorkers? Should the numWorkers is set by user by considering both the length of input data and the number of numWorkers. Implement TextVertexInputFormat for text-format graph data -- Key: GIRAPH-29 URL: https://issues.apache.org/jira/browse/GIRAPH-29 Project: Giraph Issue Type: New Feature Components: bsp Reporter: Hyunsik Choi Assignee: Hyunsik Choi Priority: Minor Fix For: 0.70.0 Supporting text-format graph data would be nice. It is helpful for developing graph algorithms and debugging because text-format graph data are human-readable and enable users to easily write sample data sets. Furthermore, text-format data are exchangeable regardless of operating systems or programming languages. So, we need a basic InputFormat to help users develop user-defined InputFormat classes to deal text-represented graph data sets. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-29) Implement TextVertexInputFormat for text-format graph data
[ https://issues.apache.org/jira/browse/GIRAPH-29?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13102177#comment-13102177 ] Hyunsik Choi commented on GIRAPH-29: Thank you for your kind reply. Implement TextVertexInputFormat for text-format graph data -- Key: GIRAPH-29 URL: https://issues.apache.org/jira/browse/GIRAPH-29 Project: Giraph Issue Type: New Feature Components: bsp Reporter: Hyunsik Choi Assignee: Hyunsik Choi Priority: Minor Fix For: 0.70.0 Supporting text-format graph data would be nice. It is helpful for developing graph algorithms and debugging because text-format graph data are human-readable and enable users to easily write sample data sets. Furthermore, text-format data are exchangeable regardless of operating systems or programming languages. So, we need a basic InputFormat to help users develop user-defined InputFormat classes to deal text-represented graph data sets. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-9) Change Yahoo License Header to Apache License Header
[ https://issues.apache.org/jira/browse/GIRAPH-9?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13092619#comment-13092619 ] Hyunsik Choi commented on GIRAPH-9: --- Unfortunately, I didn't know that there exists such a tool. I changed them in hand :) Change Yahoo License Header to Apache License Header Key: GIRAPH-9 URL: https://issues.apache.org/jira/browse/GIRAPH-9 Project: Giraph Issue Type: Task Reporter: Hyunsik Choi Assignee: Hyunsik Choi Fix For: 0.1.0 Attachments: GIRAPH-9.patch All source codes contains Yahoo License Header as follows {noformat} Licensed to Yahoo! under one or more contributor license agreements. ... {noformat} These license header should be as follows {noformat} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. ... {noformat} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-12) Investigate communication improvements
[ https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13092622#comment-13092622 ] Hyunsik Choi commented on GIRAPH-12: Netty seems to be good solution. Now, Apache Avro provides the netty-based server. If we use Avro as a rpc mechanism among workers, we could solve this problem easily. Investigate communication improvements -- Key: GIRAPH-12 URL: https://issues.apache.org/jira/browse/GIRAPH-12 Project: Giraph Issue Type: Improvement Reporter: Avery Ching Priority: Minor Currently every worker will start up a thread to communicate with every other workers. Hadoop RPC is used for communication. For instance if there are 400 workers, each worker will create 400 threads. This ends up using a lot of memory, even with the option -Dmapred.child.java.opts=-Xss64k. It would be good to investigate using frameworks like Netty or custom roll our own to improve this situation. By moving away from Hadoop RPC, we would also make compatibility of different Hadoop versions easier. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Open JIRA for git clone
I think so. However, INFRA-3887 is a duplicate of https://issues.apache.org/jira/browse/INFRA-3843. -- Hyunsik Choi On Sat, Aug 27, 2011 at 5:28 AM, Jakob Homan jgho...@gmail.com wrote: FYI: I've opened a JIRA with Infra for a git clone of the svn repo: https://issues.apache.org/jira/browse/INFRA-3887 I really can't imagine going back to svn-only development; I'd probably go to culinary school or something.