[jira] [Commented] (GIRAPH-168) Simplify munge directive usage with new munge flag HADOOP_SECURE (rather than HADOOP_FACEBOOK) and remove usage of HADOOP

2012-04-08 Thread Avery Ching (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249702#comment-13249702
 ] 

Avery Ching commented on GIRAPH-168:


Nice that you got it working with all the versions!  One question though, why 
is the line below needed in pom.xml?

giraph-0.2-SNAPSHOT-jar-with-dependencies.jar

> Simplify munge directive usage with new munge flag HADOOP_SECURE (rather than 
> HADOOP_FACEBOOK) and remove usage of HADOOP
> -
>
> Key: GIRAPH-168
> URL: https://issues.apache.org/jira/browse/GIRAPH-168
> Project: Giraph
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Eugene Koontz
>Assignee: Eugene Koontz
> Attachments: GIRAPH-168.patch, GIRAPH-168.patch, GIRAPH-168.patch, 
> GIRAPH-168.patch, GIRAPH-168.patch
>
>
> This JIRA relates to the mail thread here: 
> http://mail-archives.apache.org/mod_mbox/incubator-giraph-dev/201203.mbox/browser
> Currently we check for the munge flags HADOOP, HADOOP_FACEBOOK and 
> HADOOP_NON_SECURE when using munge in a few places. Hopefully we can 
> eliminate usage of munge in the future, but until then, we can mitigate the 
> complexity by consolidating the number of flags checked. This JIRA renames 
> HADOOP_FACEBOOK to HADOOP_SECURE, and removes usages of HADOOP, to handle the 
> same conditional compilation requirements. It also makes it easier to add 
> more maven profiles so that we can easily increase our hadoop version 
> coverage.
> This patch modifies the existing hadoop_facebook profile to use the new 
> HADOOP_SECURE munge flag, rather than HADOOP_FACEBOOK.
> It also adds a new hadoop maven profile, hadoop_trunk, which also sets 
> HADOOP_SECURE. 
> Finally, it adds a default profile, hadoop_0.20.203. This is needed so that 
> we can specify its dependencies separately from hadoop_trunk, because the 
> hadoop dependencies have changed between trunk and 0.205.0 - the former 
> requires hadoop-common, hadoop-mapreduce-client-core, and 
> hadoop-mapreduce-client-common, whereas the latter requires hadoop-core. 
> With this patch, the following passes:
> {code}
> mvn clean verify && mvn -Phadoop_trunk clean verify && mvn -Phadoop_0.20.203 
> clean verify
> {code}
> Current problems: 
> * I left in place the usage of HADOOP_NON_SECURE, but note that the profile 
> that uses this is hadoop_non_secure, which fails to compile on trunk: 
> https://issues.apache.org/jira/browse/GIRAPH-167 .
> * I couldn't get -Phadoop_facebook to work; does this work outside of 
> Facebook?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (GIRAPH-85) Simplify return expression in RPCCommunications::getRPCProxy

2012-04-08 Thread Eli Reisman (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/GIRAPH-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Reisman updated GIRAPH-85:
--

Attachment: GIRAPH-85-3.patch

re-uploading this 85-3 patch while remembering to set Grant license button to 
'on' !!!


> Simplify return expression in RPCCommunications::getRPCProxy
> 
>
> Key: GIRAPH-85
> URL: https://issues.apache.org/jira/browse/GIRAPH-85
> Project: Giraph
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Jakob Homan
>Assignee: Eli Reisman
>  Labels: newbie
> Fix For: 0.2.0
>
> Attachments: GIRAPH-85-3.patch, GIRAPH-85-3.patch, GIRAPH-85.patch, 
> GIRAPH-85.patch
>
>
> Twice in RPCCommunications::getRPCProxy a local variable, proxy, is created 
> and immediately returned.  We can simplify this to just return the value.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-08 Thread Paolo Castagna (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249601#comment-13249601
 ] 

Paolo Castagna commented on GIRAPH-170:
---

Pig and Pig Latin can certainly be used to create adjacency lists from RDF in 
N-Triples|N-Quads format.
I tend to use more plain MapReduce jobs written in Java, but I found a very old 
(i.e. it was using Pig version 0.6) example on how one might write an 
[NQuadsStorage|https://github.com/castagna/running-pig/blob/e4d12b377ee06f80be7e58d2af628028df9b2b07/src/main/java/com/talis/pig/NQuadsStorage.java]
 which implements LoadFunc and StoreFunc for Pig. I shared it, even if it does 
not even compile now, just to show how trivial that is.

It is my intention, in the next few weeks, to create a small library to support 
people wanting to use Pig, HBase, MapReduce and Giraph to process RDF data.
For Pig the first (and only?) thing to do is to implement LoadFunc and 
StoreFunc for RDF data. It seems possible (although not easy) to map the SPARQL 
algebra to Pig Latin physical operators (and SPARQL property paths to Giraph 
jobs? ;-)), that would provide a good and scalable batch processing solution 
for those into SPARQL. 
For HBase, the first step is to store RDF data, even a plain [(G)|S|P|O] 
solution would do initially.
For MapReduce, blank nodes can be painful, I have some tricks to share here. 
Input/output formats and record readers/writers, etc.

In relation to Giraph, to bring the discussion on topic, until I am proven 
wrong, I am going for the adjacency list approach as discussed above and do 
graph processing as other 'usual' Giraph jobs.

The question: what are the RDF processing use cases which are a good fit for 
Giraph is still open for me (and I'll find out soon).

> Workflow for loading RDF graph data into Giraph
> ---
>
> Key: GIRAPH-170
> URL: https://issues.apache.org/jira/browse/GIRAPH-170
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Dan Brickley
>Priority: Minor
>
> W3C RDF provides a family of Web standards for exchanging graph-based data. 
> RDF uses sets of simple binary relationships, labeling nodes and links with 
> Web identifiers (URIs). Many public datasets are available as RDF, including 
> the "Linked Data" cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
> such datasets are listed at http://thedatahub.org/
> RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
> line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
> language is Turtle. Apache Jena and Any23 provide software to handle all 
> these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
> This JIRA leaves open the strategy for loading RDF data into Giraph. There 
> are various possibilites, including exploitation of intermediate 
> Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
> more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
> or implementor notes here would be an advance on the current state of the 
> art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
> various RDF datasources.
> Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
> touches on the issue (since we can't currently easily represent fully general 
> RDF graphs since two nodes might be connected by more than one typed edge). 
> Even without multigraphs it ought to be possible to bring RDF-sourced data
> into Giraph, e.g. perhaps some app is only interested in say the Movies + 
> People subset of a big RDF collection.
> From Avery in email: "a helper VertexInputFormat (and maybe 
> VertexOutputFormat) would certainly [despite GIRAPH-141] still help"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-08 Thread Dan Brickley (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249597#comment-13249597
 ] 

Dan Brickley commented on GIRAPH-170:
-

ah 
https://github.com/ogrisel/pignlproc/blob/master/src/main/java/pignlproc/storage/UriStringLiteralNTriplesStorer.java
 handles literals; my mistake. Ok, so that'll get ntriples parsed into Pig, ... 
how to write them out again nicely for Giraph, or ingest directly into Giraph?

> Workflow for loading RDF graph data into Giraph
> ---
>
> Key: GIRAPH-170
> URL: https://issues.apache.org/jira/browse/GIRAPH-170
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Dan Brickley
>Priority: Minor
>
> W3C RDF provides a family of Web standards for exchanging graph-based data. 
> RDF uses sets of simple binary relationships, labeling nodes and links with 
> Web identifiers (URIs). Many public datasets are available as RDF, including 
> the "Linked Data" cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
> such datasets are listed at http://thedatahub.org/
> RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
> line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
> language is Turtle. Apache Jena and Any23 provide software to handle all 
> these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
> This JIRA leaves open the strategy for loading RDF data into Giraph. There 
> are various possibilites, including exploitation of intermediate 
> Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
> more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
> or implementor notes here would be an advance on the current state of the 
> art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
> various RDF datasources.
> Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
> touches on the issue (since we can't currently easily represent fully general 
> RDF graphs since two nodes might be connected by more than one typed edge). 
> Even without multigraphs it ought to be possible to bring RDF-sourced data
> into Giraph, e.g. perhaps some app is only interested in say the Movies + 
> People subset of a big RDF collection.
> From Avery in email: "a helper VertexInputFormat (and maybe 
> VertexOutputFormat) would certainly [despite GIRAPH-141] still help"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-08 Thread Dan Brickley (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249590#comment-13249590
 ] 

Dan Brickley commented on GIRAPH-170:
-

Paolo (spelled right this time... sorry!), does Pig sound like an appropriate 
tool for that sort of pre-processing? I thought I'd seen some graph 
manipulation code around somewhere that might do the ntriples to adjacency list 
work, but can't find the link. Closest I've found is 
http://thedatachef.blogspot.com/2011/05/structural-similarity-with-apache-pig.html
 

https://github.com/ogrisel/pignlproc also has some code for ntriples parsing 
from Pig, e.g. 
https://github.com/ogrisel/pignlproc/blob/master/src/main/java/pignlproc/storage/UriUriNTriplesLoader.java
 though it doesn't (from quick look) seem to handle literal values.

> Workflow for loading RDF graph data into Giraph
> ---
>
> Key: GIRAPH-170
> URL: https://issues.apache.org/jira/browse/GIRAPH-170
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Dan Brickley
>Priority: Minor
>
> W3C RDF provides a family of Web standards for exchanging graph-based data. 
> RDF uses sets of simple binary relationships, labeling nodes and links with 
> Web identifiers (URIs). Many public datasets are available as RDF, including 
> the "Linked Data" cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
> such datasets are listed at http://thedatahub.org/
> RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
> line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
> language is Turtle. Apache Jena and Any23 provide software to handle all 
> these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
> This JIRA leaves open the strategy for loading RDF data into Giraph. There 
> are various possibilites, including exploitation of intermediate 
> Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
> more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
> or implementor notes here would be an advance on the current state of the 
> art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
> various RDF datasources.
> Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
> touches on the issue (since we can't currently easily represent fully general 
> RDF graphs since two nodes might be connected by more than one typed edge). 
> Even without multigraphs it ought to be possible to bring RDF-sourced data
> into Giraph, e.g. perhaps some app is only interested in say the Movies + 
> People subset of a big RDF collection.
> From Avery in email: "a helper VertexInputFormat (and maybe 
> VertexOutputFormat) would certainly [despite GIRAPH-141] still help"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira