[jira] [Updated] (GIRAPH-171) total time in MasterThread.run() is calculated incorrectly
[ https://issues.apache.org/jira/browse/GIRAPH-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Koontz updated GIRAPH-171: - Attachment: GIRAPH-171.patch > total time in MasterThread.run() is calculated incorrectly > -- > > Key: GIRAPH-171 > URL: https://issues.apache.org/jira/browse/GIRAPH-171 > Project: Giraph > Issue Type: Bug >Reporter: Eugene Koontz >Assignee: Eugene Koontz > Attachments: GIRAPH-171.patch > > > While running PageMarkBenchMark, I was seeing in the output: > {{graph.MasterThread(172): total: Took 1.3336739262910001E9 seconds.}} > This was because currently, in {{MasterThread.run()}}, we have: > {code} > LOG.info("total: Took " + > ((System.currentTimeMillis() / 1000.0d) - > setupSecs) + " seconds."); > {code} > but it should be: > {code} >LOG.info("total: Took " + >((System.currentTimeMillis() - startMillis) / > 1000.0d) + " seconds."); > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (GIRAPH-171) total time in MasterThread.run() is calculated incorrectly
total time in MasterThread.run() is calculated incorrectly -- Key: GIRAPH-171 URL: https://issues.apache.org/jira/browse/GIRAPH-171 Project: Giraph Issue Type: Bug Reporter: Eugene Koontz Assignee: Eugene Koontz Attachments: GIRAPH-171.patch While running PageMarkBenchMark, I was seeing in the output: {{graph.MasterThread(172): total: Took 1.3336739262910001E9 seconds.}} This was because currently, in {{MasterThread.run()}}, we have: {code} LOG.info("total: Took " + ((System.currentTimeMillis() / 1000.0d) - setupSecs) + " seconds."); {code} but it should be: {code} LOG.info("total: Took " + ((System.currentTimeMillis() - startMillis) / 1000.0d) + " seconds."); {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph
[ https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247542#comment-13247542 ] Paolo Castagna commented on GIRAPH-170: --- bq. we may want to consider therefore adding backlinks Yep. I'd like to better understand what people currently do if they need incoming and outgoing links for their processing. An adjacency list can be constructed listing incoming (a.k.a. backlinks) as well as outgoing links, in one MapReduce job. Input: s1 -p1-> o1 s1 -p2-> o2 s1 -p2-> o3 s2 -p1-> s1 s2 ... Output (adjacency list): s1 (out: p1 o1) (out: p2 o2) (out: p2 o3) (in: s2 p1) s2 ... Whether it is better to do it this way or have support from the Giraph APIs avoiding an initial MapReduce job to construct the adjacency list, I do not know yet. > Workflow for loading RDF graph data into Giraph > --- > > Key: GIRAPH-170 > URL: https://issues.apache.org/jira/browse/GIRAPH-170 > Project: Giraph > Issue Type: New Feature >Reporter: Dan Brickley >Priority: Minor > > W3C RDF provides a family of Web standards for exchanging graph-based data. > RDF uses sets of simple binary relationships, labeling nodes and links with > Web identifiers (URIs). Many public datasets are available as RDF, including > the "Linked Data" cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many > such datasets are listed at http://thedatahub.org/ > RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple > line-oriented format is N-Triples. A format aligned with RDF's SPARQL query > language is Turtle. Apache Jena and Any23 provide software to handle all > these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/ > This JIRA leaves open the strategy for loading RDF data into Giraph. There > are various possibilites, including exploitation of intermediate > Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a > more Giraph-friendly form, or writing custom loaders. Even a HOWTO document > or implementor notes here would be an advance on the current state of the > art. The BluePrints Graph API (Gremlin etc.) has also been aligned with > various RDF datasources. > Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 > touches on the issue (since we can't currently easily represent fully general > RDF graphs since two nodes might be connected by more than one typed edge). > Even without multigraphs it ought to be possible to bring RDF-sourced data > into Giraph, e.g. perhaps some app is only interested in say the Movies + > People subset of a big RDF collection. > From Avery in email: "a helper VertexInputFormat (and maybe > VertexOutputFormat) would certainly [despite GIRAPH-141] still help" -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph
[ https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247506#comment-13247506 ] Dan Brickley commented on GIRAPH-170: - Another architectural note around RDF: RDF is basically simple factual data expressed as sets of binary relationships. In that sense it is a graph directly, already. However often RDF describes something that is in a deeper sense also a graph. Common examples include FOAF, where node and edge types (Person, Document, Group, etc.) can express matrix of collaboration, social linkage, etc. Or from DBpedia.org, Freebase etc., we have for example datasets of movies and actors. In the dbpedia case, it's simple enough; a movie node, an actor node, and a typed link between them. Freebase by contrast, reifies the 'starring' relationship into another node, ... so you can represent dates, character name etc. This sort of meta-information (properties of links) is also btw in the BluePrints/Gremlin API. One point here is that a 'starring' link pointing from a Movie to an Actor, tells us the same, but in reverse, as what we would have learned from a 'starsIn' link from the Actor to the Movie. For Giraph we may want to consider therefore adding backlinks so each node is equally aware of properties pointing both in, and out. > Workflow for loading RDF graph data into Giraph > --- > > Key: GIRAPH-170 > URL: https://issues.apache.org/jira/browse/GIRAPH-170 > Project: Giraph > Issue Type: New Feature >Reporter: Dan Brickley >Priority: Minor > > W3C RDF provides a family of Web standards for exchanging graph-based data. > RDF uses sets of simple binary relationships, labeling nodes and links with > Web identifiers (URIs). Many public datasets are available as RDF, including > the "Linked Data" cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many > such datasets are listed at http://thedatahub.org/ > RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple > line-oriented format is N-Triples. A format aligned with RDF's SPARQL query > language is Turtle. Apache Jena and Any23 provide software to handle all > these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/ > This JIRA leaves open the strategy for loading RDF data into Giraph. There > are various possibilites, including exploitation of intermediate > Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a > more Giraph-friendly form, or writing custom loaders. Even a HOWTO document > or implementor notes here would be an advance on the current state of the > art. The BluePrints Graph API (Gremlin etc.) has also been aligned with > various RDF datasources. > Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 > touches on the issue (since we can't currently easily represent fully general > RDF graphs since two nodes might be connected by more than one typed edge). > Even without multigraphs it ought to be possible to bring RDF-sourced data > into Giraph, e.g. perhaps some app is only interested in say the Movies + > People subset of a big RDF collection. > From Avery in email: "a helper VertexInputFormat (and maybe > VertexOutputFormat) would certainly [despite GIRAPH-141] still help" -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: On helping new contributors pitch in quickly...
On 5 April 2012 17:05, Avery Ching wrote: > Dan, you're definitely right that this has been mentioned a few times. The > multigraph issue is one part of it, but a helper VertexInputFormat (and > maybe VertexOutputFormat) would certainly still help as you mention. Can > you please open a JIRA (and help if you have time)? Here you go: https://issues.apache.org/jira/browse/GIRAPH-170 I've tried to summarise discussion from here and elsewhere. Dan
[jira] [Created] (GIRAPH-170) Workflow for loading RDF graph data into Giraph
Workflow for loading RDF graph data into Giraph --- Key: GIRAPH-170 URL: https://issues.apache.org/jira/browse/GIRAPH-170 Project: Giraph Issue Type: New Feature Reporter: Dan Brickley Priority: Minor W3C RDF provides a family of Web standards for exchanging graph-based data. RDF uses sets of simple binary relationships, labeling nodes and links with Web identifiers (URIs). Many public datasets are available as RDF, including the "Linked Data" cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many such datasets are listed at http://thedatahub.org/ RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple line-oriented format is N-Triples. A format aligned with RDF's SPARQL query language is Turtle. Apache Jena and Any23 provide software to handle all these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/ This JIRA leaves open the strategy for loading RDF data into Giraph. There are various possibilites, including exploitation of intermediate Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a more Giraph-friendly form, or writing custom loaders. Even a HOWTO document or implementor notes here would be an advance on the current state of the art. The BluePrints Graph API (Gremlin etc.) has also been aligned with various RDF datasources. Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 touches on the issue (since we can't currently easily represent fully general RDF graphs since two nodes might be connected by more than one typed edge). Even without multigraphs it ought to be possible to bring RDF-sourced data into Giraph, e.g. perhaps some app is only interested in say the Movies + People subset of a big RDF collection. >From Avery in email: "a helper VertexInputFormat (and maybe >VertexOutputFormat) would certainly [despite GIRAPH-141] still help" -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph
[ https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247490#comment-13247490 ] Dan Brickley commented on GIRAPH-170: - >From Paulo in email: """ I suspect N-Triples | N-Quads might not be the best option for something like Giraph. Something more like an adjacency list might be better. So, my intuition, is that if you start with RDF in N-Triples format, the first step would be a simple MapReduce job to group RDF statements by subject (eventually filtering out certain properties): Input: s1 --p1--> o1 s1 --p2--> o2 s1 --p2--> o3 s2 ... Output (adjacency list): s1 (p1 o1) (p2 o2) (p2 o3) s2 ...""" > Workflow for loading RDF graph data into Giraph > --- > > Key: GIRAPH-170 > URL: https://issues.apache.org/jira/browse/GIRAPH-170 > Project: Giraph > Issue Type: New Feature >Reporter: Dan Brickley >Priority: Minor > > W3C RDF provides a family of Web standards for exchanging graph-based data. > RDF uses sets of simple binary relationships, labeling nodes and links with > Web identifiers (URIs). Many public datasets are available as RDF, including > the "Linked Data" cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many > such datasets are listed at http://thedatahub.org/ > RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple > line-oriented format is N-Triples. A format aligned with RDF's SPARQL query > language is Turtle. Apache Jena and Any23 provide software to handle all > these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/ > This JIRA leaves open the strategy for loading RDF data into Giraph. There > are various possibilites, including exploitation of intermediate > Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a > more Giraph-friendly form, or writing custom loaders. Even a HOWTO document > or implementor notes here would be an advance on the current state of the > art. The BluePrints Graph API (Gremlin etc.) has also been aligned with > various RDF datasources. > Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 > touches on the issue (since we can't currently easily represent fully general > RDF graphs since two nodes might be connected by more than one typed edge). > Even without multigraphs it ought to be possible to bring RDF-sourced data > into Giraph, e.g. perhaps some app is only interested in say the Movies + > People subset of a big RDF collection. > From Avery in email: "a helper VertexInputFormat (and maybe > VertexOutputFormat) would certainly [despite GIRAPH-141] still help" -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (GIRAPH-168) Simplify munge directive usage with new munge flag HADOOP_SECURE (rather than HADOOP_FACEBOOK) and remove usage of HADOOP
[ https://issues.apache.org/jira/browse/GIRAPH-168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Koontz updated GIRAPH-168: - Attachment: GIRAPH-168.patch -Update hadoop_trunk to look for Hadoop version 3.0.0-SNAPSHOT -change HADOOP_OLDRPC munge flag to more descriptive HADOOP_NON_SASL_RPC > Simplify munge directive usage with new munge flag HADOOP_SECURE (rather than > HADOOP_FACEBOOK) and remove usage of HADOOP > - > > Key: GIRAPH-168 > URL: https://issues.apache.org/jira/browse/GIRAPH-168 > Project: Giraph > Issue Type: Improvement >Affects Versions: 0.2.0 >Reporter: Eugene Koontz >Assignee: Eugene Koontz > Attachments: GIRAPH-168.patch, GIRAPH-168.patch, GIRAPH-168.patch, > GIRAPH-168.patch > > > This JIRA relates to the mail thread here: > http://mail-archives.apache.org/mod_mbox/incubator-giraph-dev/201203.mbox/browser > Currently we check for the munge flags HADOOP, HADOOP_FACEBOOK and > HADOOP_NON_SECURE when using munge in a few places. Hopefully we can > eliminate usage of munge in the future, but until then, we can mitigate the > complexity by consolidating the number of flags checked. This JIRA renames > HADOOP_FACEBOOK to HADOOP_SECURE, and removes usages of HADOOP, to handle the > same conditional compilation requirements. It also makes it easier to add > more maven profiles so that we can easily increase our hadoop version > coverage. > This patch modifies the existing hadoop_facebook profile to use the new > HADOOP_SECURE munge flag, rather than HADOOP_FACEBOOK. > It also adds a new hadoop maven profile, hadoop_trunk, which also sets > HADOOP_SECURE. > Finally, it adds a default profile, hadoop_0.20.203. This is needed so that > we can specify its dependencies separately from hadoop_trunk, because the > hadoop dependencies have changed between trunk and 0.205.0 - the former > requires hadoop-common, hadoop-mapreduce-client-core, and > hadoop-mapreduce-client-common, whereas the latter requires hadoop-core. > With this patch, the following passes: > {code} > mvn clean verify && mvn -Phadoop_trunk clean verify && mvn -Phadoop_0.20.203 > clean verify > {code} > Current problems: > * I left in place the usage of HADOOP_NON_SECURE, but note that the profile > that uses this is hadoop_non_secure, which fails to compile on trunk: > https://issues.apache.org/jira/browse/GIRAPH-167 . > * I couldn't get -Phadoop_facebook to work; does this work outside of > Facebook? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-168) Simplify munge directive usage with new munge flag HADOOP_SECURE (rather than HADOOP_FACEBOOK) and remove usage of HADOOP
[ https://issues.apache.org/jira/browse/GIRAPH-168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13247396#comment-13247396 ] Eugene Koontz commented on GIRAPH-168: -- Hi Jakob, I wonder if HADOOP_NO_SASL might be better than HADOOP_OLDRPC (since the divergence in RPC has to do with HADOOP-6419 ("Change RPC layer to support SASL based mutual authentication"))? > Simplify munge directive usage with new munge flag HADOOP_SECURE (rather than > HADOOP_FACEBOOK) and remove usage of HADOOP > - > > Key: GIRAPH-168 > URL: https://issues.apache.org/jira/browse/GIRAPH-168 > Project: Giraph > Issue Type: Improvement >Affects Versions: 0.2.0 >Reporter: Eugene Koontz >Assignee: Eugene Koontz > Attachments: GIRAPH-168.patch, GIRAPH-168.patch, GIRAPH-168.patch > > > This JIRA relates to the mail thread here: > http://mail-archives.apache.org/mod_mbox/incubator-giraph-dev/201203.mbox/browser > Currently we check for the munge flags HADOOP, HADOOP_FACEBOOK and > HADOOP_NON_SECURE when using munge in a few places. Hopefully we can > eliminate usage of munge in the future, but until then, we can mitigate the > complexity by consolidating the number of flags checked. This JIRA renames > HADOOP_FACEBOOK to HADOOP_SECURE, and removes usages of HADOOP, to handle the > same conditional compilation requirements. It also makes it easier to add > more maven profiles so that we can easily increase our hadoop version > coverage. > This patch modifies the existing hadoop_facebook profile to use the new > HADOOP_SECURE munge flag, rather than HADOOP_FACEBOOK. > It also adds a new hadoop maven profile, hadoop_trunk, which also sets > HADOOP_SECURE. > Finally, it adds a default profile, hadoop_0.20.203. This is needed so that > we can specify its dependencies separately from hadoop_trunk, because the > hadoop dependencies have changed between trunk and 0.205.0 - the former > requires hadoop-common, hadoop-mapreduce-client-core, and > hadoop-mapreduce-client-common, whereas the latter requires hadoop-core. > With this patch, the following passes: > {code} > mvn clean verify && mvn -Phadoop_trunk clean verify && mvn -Phadoop_0.20.203 > clean verify > {code} > Current problems: > * I left in place the usage of HADOOP_NON_SECURE, but note that the profile > that uses this is hadoop_non_secure, which fails to compile on trunk: > https://issues.apache.org/jira/browse/GIRAPH-167 . > * I couldn't get -Phadoop_facebook to work; does this work outside of > Facebook? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: On helping new contributors pitch in quickly...
Here is a related JIRA https://issues.apache.org/jira/browse/GIRAPH-155 Avery On 4/5/12 9:45 AM, Paolo Castagna wrote: Hi Dan, I have not an answer to your questions/observations yet. However, I suspect N-Triples | N-Quads might not be the best option for something like Giraph. Something more like an adjacency list might be better. So, my intuition, is that if you start with RDF in N-Triples format, the first step would be a simple MapReduce job to group RDF statements by subject (eventually filtering out certain properties): Input: s1 --p1--> o1 s1 --p2--> o2 s1 --p2--> o3 s2 ... Output (adjacency list): s1 (p1 o1) (p2 o2) (p2 o3) s2 ... But, as I said, is it too early for me to say definitely this is the best approach. Paolo Dan Brickley wrote: On 5 April 2012 05:49, Jakob Homan wrote: Ack!, I suck. Sorry. I hadn't realized we'd gone through most of them, which itself is a good thing. I'll get some new ones added first thing in the morning. Sorry. Do we have something around "document a workflow to get RDF graph data into Giraph?". A few of us have been talking about it here or there, and I've heard various strategies mentioned (e.g. Ntriples as it's a simple line-oriented format; piggybacking on HBase or other storage that Giraph already has adaptors for; integrating Apache Jena; ...). I can't find much in JIRA but https://issues.apache.org/jira/browse/GIRAPH-141 touches on the issue (since we can't currently easily represent fully general RDF graphs since two nodes might be connected by more than one typed edge). Even without multigraphs it ought to be possible to bring RDF-sourced data into Giraph, e.g. perhaps some app is only interested in say the Movies + People subset of a big RDF collection. And so perhaps most of the work is in preprocessing for now - e.g. via Ntriples + Pig; but still it would be great to have a clear HOWTO. As an interested party on the periphery, a JIRA for this would give a natural place to monitor, read up, maybe even help. And I'm sure I'm not alone... cheers, Dan
Re: On helping new contributors pitch in quickly...
Hi Dan, I have not an answer to your questions/observations yet. However, I suspect N-Triples | N-Quads might not be the best option for something like Giraph. Something more like an adjacency list might be better. So, my intuition, is that if you start with RDF in N-Triples format, the first step would be a simple MapReduce job to group RDF statements by subject (eventually filtering out certain properties): Input: s1 --p1--> o1 s1 --p2--> o2 s1 --p2--> o3 s2 ... Output (adjacency list): s1 (p1 o1) (p2 o2) (p2 o3) s2 ... But, as I said, is it too early for me to say definitely this is the best approach. Paolo Dan Brickley wrote: > On 5 April 2012 05:49, Jakob Homan wrote: >> Ack!, I suck. Sorry. I hadn't realized we'd gone through most of >> them, which itself is a good thing. I'll get some new ones added >> first thing in the morning. Sorry. > > Do we have something around "document a workflow to get RDF graph data > into Giraph?". A few of us have been talking about it here or there, > and I've heard various strategies mentioned (e.g. Ntriples as it's a > simple line-oriented format; piggybacking on HBase or other storage > that Giraph already has adaptors for; integrating Apache Jena; ...). I > can't find much in JIRA but > https://issues.apache.org/jira/browse/GIRAPH-141 touches on the issue > (since we can't currently easily represent fully general RDF graphs > since two nodes might be connected by more than one typed edge). Even > without multigraphs it ought to be possible to bring RDF-sourced data > into Giraph, e.g. perhaps some app is only interested in say the > Movies + People subset of a big RDF collection. And so perhaps most of > the work is in preprocessing for now - e.g. via Ntriples + Pig; but > still it would be great to have a clear HOWTO. > > As an interested party on the periphery, a JIRA for this would give a > natural place to monitor, read up, maybe even help. And I'm sure I'm > not alone... > > cheers, > > Dan
Re: On helping new contributors pitch in quickly...
Dan, you're definitely right that this has been mentioned a few times. The multigraph issue is one part of it, but a helper VertexInputFormat (and maybe VertexOutputFormat) would certainly still help as you mention. Can you please open a JIRA (and help if you have time)? Avery On 4/5/12 1:49 AM, Dan Brickley wrote: On 5 April 2012 05:49, Jakob Homan wrote: Ack!, I suck. Sorry. I hadn't realized we'd gone through most of them, which itself is a good thing. I'll get some new ones added first thing in the morning. Sorry. Do we have something around "document a workflow to get RDF graph data into Giraph?". A few of us have been talking about it here or there, and I've heard various strategies mentioned (e.g. Ntriples as it's a simple line-oriented format; piggybacking on HBase or other storage that Giraph already has adaptors for; integrating Apache Jena; ...). I can't find much in JIRA but https://issues.apache.org/jira/browse/GIRAPH-141 touches on the issue (since we can't currently easily represent fully general RDF graphs since two nodes might be connected by more than one typed edge). Even without multigraphs it ought to be possible to bring RDF-sourced data into Giraph, e.g. perhaps some app is only interested in say the Movies + People subset of a big RDF collection. And so perhaps most of the work is in preprocessing for now - e.g. via Ntriples + Pig; but still it would be great to have a clear HOWTO. As an interested party on the periphery, a JIRA for this would give a natural place to monitor, read up, maybe even help. And I'm sure I'm not alone... cheers, Dan
Re: On helping new contributors pitch in quickly...
On 5 April 2012 05:49, Jakob Homan wrote: > Ack!, I suck. Sorry. I hadn't realized we'd gone through most of > them, which itself is a good thing. I'll get some new ones added > first thing in the morning. Sorry. Do we have something around "document a workflow to get RDF graph data into Giraph?". A few of us have been talking about it here or there, and I've heard various strategies mentioned (e.g. Ntriples as it's a simple line-oriented format; piggybacking on HBase or other storage that Giraph already has adaptors for; integrating Apache Jena; ...). I can't find much in JIRA but https://issues.apache.org/jira/browse/GIRAPH-141 touches on the issue (since we can't currently easily represent fully general RDF graphs since two nodes might be connected by more than one typed edge). Even without multigraphs it ought to be possible to bring RDF-sourced data into Giraph, e.g. perhaps some app is only interested in say the Movies + People subset of a big RDF collection. And so perhaps most of the work is in preprocessing for now - e.g. via Ntriples + Pig; but still it would be great to have a clear HOWTO. As an interested party on the periphery, a JIRA for this would give a natural place to monitor, read up, maybe even help. And I'm sure I'm not alone... cheers, Dan
Re: Giraph as Whirr service, see WHIRR-530
Thank you all for your comments. There seems to be some interest and certainly agreement on just "for testing"/"temporary" and the limits on cloud infrastructure in relation to things as Hadoop, ZooKeeper and Giraph. I also agree that, given Whirr can already spin Hadoop clusters, user can run Giraph that way. Whirr option might become more interesting in relation to YARN and perhaps unit/integration testing (although, I am not sure if/who is willing to put a credit card that). Fortunately, Giraph tests run reasonably well and quickly locally. Anyway, I'll keep an eye on WHIRR-530 and as I learn more about Giraph and Whirr help that if I can. Personally, I am more interested in YARN and Giraph than Giraph in its current shape. Or, in orther words, in the future of Giraph rather than in the past (i.e. backward compatibility/legacy) (although, I am aware you have that in your mind as well and it seems to me there are already Giraph users, so...) Thanks, Paolo Paolo Castagna wrote: > Hi, > seen this? > > WHIRR-530 - Add Giraph as a service > https://issues.apache.org/jira/browse/WHIRR-530 > > This could be quite useful for users who want to give Giraph a spin on cloud > infrastructure, just for testing or to run a few small experiments. > My experience with Whirr an small 10-20 nodes clusters has be quite positive. > Less so for larger clusters, but it more a problem/limit with the cloud > provider rather than Whirr itself. I think. > > Whirr makes extremely easy and pleasant deploy stuff on-demand. > > ... and Whirr already supports YARN: > https://issues.apache.org/jira/browse/WHIRR-391 > > Is any Giraph developers/users here also a Whirr user? > > Paolo
Re: Giraph as Whirr service, see WHIRR-530
Having used Whirr several times in EC2, it seems like a fine way to spin up a temporary 'developers' cluster. Zookeeper is the most likely source of difficulty on VMs with limited I/O (i.e., it's very chatty and doesn't tolerate the highly variable latency that smaller AMIs provide). The HBase community seems to be very aware of this; there's likely some tips and tricks to be gleaned from reading their mailing lists. -Dan On Wed, Apr 4, 2012 at 11:08 PM, Brian Femiano wrote: > I've used it on clusters I started on EC2 launched by Whirr. Simply copy > the fat > jar to your client machine and it will distribute normally as a M/R > dependency. > > It works very well. > > The only limitation I could potentially find (without much proof) was on > VMs > with limited IO the RPC message overhead between workers could be an issue. > I never tried it on VMs with less than 'High' IO, so take that with a grain > of salt. > > On Thu, Apr 5, 2012 at 12:51 AM, Jakob Homan wrote: > > > This is interesting. Whirr can already spin up Hadoop MR clusters, > > which can then run the Giraph jobs. Once Giraph is bootstrapped onto > > YARN, this will make more sense as a Whirr service. > > > > On Wed, Apr 4, 2012 at 9:43 PM, Avery Ching wrote: > > > I don't use Whirr...I haven't heard it mentioned on this forum yet. > > Anyone? > > > > > > Avery > > > > > > > > > On 4/4/12 9:30 PM, Paolo Castagna wrote: > > >> > > >> Hi, > > >> seen this? > > >> > > >> WHIRR-530 - Add Giraph as a service > > >> https://issues.apache.org/jira/browse/WHIRR-530 > > >> > > >> This could be quite useful for users who want to give Giraph a spin on > > >> cloud > > >> infrastructure, just for testing or to run a few small experiments. > > >> My experience with Whirr an small 10-20 nodes clusters has be quite > > >> positive. > > >> Less so for larger clusters, but it more a problem/limit with the > cloud > > >> provider rather than Whirr itself. I think. > > >> > > >> Whirr makes extremely easy and pleasant deploy stuff on-demand. > > >> > > >> ... and Whirr already supports YARN: > > >> https://issues.apache.org/jira/browse/WHIRR-391 > > >> > > >> Is any Giraph developers/users here also a Whirr user? > > >> > > >> Paolo > > > > > > > > > -- Daniel McClary, Ph.D. Visiting Scholar *Amaral Lab and Department of Chemical and Biological Engineering, Northwestern University* Bioinformatics Specialist II *Howard Hughes Medical Institute* Email: dan.mccl...@northwestern.edu Phone: (847) 491-1234 Web: http://amaral-lab.org/people/mcclary/ Mailing address: 2145 Sheridan Rd, Room E-136 Northwestern University Evanston, IL 60208