[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-19 Thread Paolo Castagna (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257695#comment-13257695
 ] 

Paolo Castagna commented on GIRAPH-170:
---

Hi Benjamin

 I call this the RDFAdjacencyCSV

We came to the same conclusion. I ended up using Turtle for this, as explained 
here: 
http://mail-archives.apache.org/mod_mbox/incubator-giraph-user/201204.mbox/%3C4F84872E.4050101%40googlemail.com%3E

Turtle isn't splittable in general, but it can be made so simply writing all 
the RDF statements with the same subject on a single line.

 I would like to say that Paolos suggestion of providing some ready made code 
 for Pig, HBase and MapReduce for processing RDF sounds like a really great 
 contribution. 

I am not sure what's the best place to put such code, I started with sharing 
small examples and experiments on GitHub, here: 
https://github.com/castagna/jena-grande

 Integration of RDF reasoning capabilities: I will need to perform subclass 
 reasoning on the DBPedia graph.

See Apache Jena's RIOT infer command or a MapReduce version of it, here: 
https://github.com/castagna/tdbloader4/blob/master/src/main/java/org/apache/jena/tdbloader4/InferDriver.java

I wonder if Giraph could be used to implement the RETE algorithm 
(http://en.wikipedia.org/wiki/Rete_algorithm) which is what Jena uses (with in 
memory RDF Jena models).

 Workflow for loading RDF graph data into Giraph
 ---

 Key: GIRAPH-170
 URL: https://issues.apache.org/jira/browse/GIRAPH-170
 Project: Giraph
  Issue Type: New Feature
Reporter: Dan Brickley
Priority: Minor

 W3C RDF provides a family of Web standards for exchanging graph-based data. 
 RDF uses sets of simple binary relationships, labeling nodes and links with 
 Web identifiers (URIs). Many public datasets are available as RDF, including 
 the Linked Data cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
 such datasets are listed at http://thedatahub.org/
 RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
 line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
 language is Turtle. Apache Jena and Any23 provide software to handle all 
 these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
 This JIRA leaves open the strategy for loading RDF data into Giraph. There 
 are various possibilites, including exploitation of intermediate 
 Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
 more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
 or implementor notes here would be an advance on the current state of the 
 art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
 various RDF datasources.
 Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
 touches on the issue (since we can't currently easily represent fully general 
 RDF graphs since two nodes might be connected by more than one typed edge). 
 Even without multigraphs it ought to be possible to bring RDF-sourced data
 into Giraph, e.g. perhaps some app is only interested in say the Movies + 
 People subset of a big RDF collection.
 From Avery in email: a helper VertexInputFormat (and maybe 
 VertexOutputFormat) would certainly [despite GIRAPH-141] still help

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-08 Thread Paolo Castagna (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249601#comment-13249601
 ] 

Paolo Castagna commented on GIRAPH-170:
---

Pig and Pig Latin can certainly be used to create adjacency lists from RDF in 
N-Triples|N-Quads format.
I tend to use more plain MapReduce jobs written in Java, but I found a very old 
(i.e. it was using Pig version 0.6) example on how one might write an 
[NQuadsStorage|https://github.com/castagna/running-pig/blob/e4d12b377ee06f80be7e58d2af628028df9b2b07/src/main/java/com/talis/pig/NQuadsStorage.java]
 which implements LoadFunc and StoreFunc for Pig. I shared it, even if it does 
not even compile now, just to show how trivial that is.

It is my intention, in the next few weeks, to create a small library to support 
people wanting to use Pig, HBase, MapReduce and Giraph to process RDF data.
For Pig the first (and only?) thing to do is to implement LoadFunc and 
StoreFunc for RDF data. It seems possible (although not easy) to map the SPARQL 
algebra to Pig Latin physical operators (and SPARQL property paths to Giraph 
jobs? ;-)), that would provide a good and scalable batch processing solution 
for those into SPARQL. 
For HBase, the first step is to store RDF data, even a plain [(G)|S|P|O] 
solution would do initially.
For MapReduce, blank nodes can be painful, I have some tricks to share here. 
Input/output formats and record readers/writers, etc.

In relation to Giraph, to bring the discussion on topic, until I am proven 
wrong, I am going for the adjacency list approach as discussed above and do 
graph processing as other 'usual' Giraph jobs.

The question: what are the RDF processing use cases which are a good fit for 
Giraph is still open for me (and I'll find out soon).

 Workflow for loading RDF graph data into Giraph
 ---

 Key: GIRAPH-170
 URL: https://issues.apache.org/jira/browse/GIRAPH-170
 Project: Giraph
  Issue Type: New Feature
Reporter: Dan Brickley
Priority: Minor

 W3C RDF provides a family of Web standards for exchanging graph-based data. 
 RDF uses sets of simple binary relationships, labeling nodes and links with 
 Web identifiers (URIs). Many public datasets are available as RDF, including 
 the Linked Data cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
 such datasets are listed at http://thedatahub.org/
 RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
 line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
 language is Turtle. Apache Jena and Any23 provide software to handle all 
 these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
 This JIRA leaves open the strategy for loading RDF data into Giraph. There 
 are various possibilites, including exploitation of intermediate 
 Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
 more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
 or implementor notes here would be an advance on the current state of the 
 art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
 various RDF datasources.
 Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
 touches on the issue (since we can't currently easily represent fully general 
 RDF graphs since two nodes might be connected by more than one typed edge). 
 Even without multigraphs it ought to be possible to bring RDF-sourced data
 into Giraph, e.g. perhaps some app is only interested in say the Movies + 
 People subset of a big RDF collection.
 From Avery in email: a helper VertexInputFormat (and maybe 
 VertexOutputFormat) would certainly [despite GIRAPH-141] still help

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-05 Thread Paolo Castagna (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13247542#comment-13247542
 ] 

Paolo Castagna commented on GIRAPH-170:
---

bq. we may want to consider therefore adding backlinks

Yep. I'd like to better understand what people currently do if they need 
incoming and outgoing links for their processing.
An adjacency list can be constructed listing incoming (a.k.a. backlinks) as 
well as outgoing links, in one MapReduce job.

Input:

s1 -p1- o1
s1 -p2- o2
s1 -p2- o3
s2 -p1- s1
s2 ...

Output (adjacency list):

s1 (out: p1 o1) (out: p2 o2) (out: p2 o3) (in: s2 p1)
s2 ...

Whether it is better to do it this way or have support from the Giraph APIs 
avoiding an initial MapReduce job to construct the adjacency list, I do not 
know yet.

 Workflow for loading RDF graph data into Giraph
 ---

 Key: GIRAPH-170
 URL: https://issues.apache.org/jira/browse/GIRAPH-170
 Project: Giraph
  Issue Type: New Feature
Reporter: Dan Brickley
Priority: Minor

 W3C RDF provides a family of Web standards for exchanging graph-based data. 
 RDF uses sets of simple binary relationships, labeling nodes and links with 
 Web identifiers (URIs). Many public datasets are available as RDF, including 
 the Linked Data cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
 such datasets are listed at http://thedatahub.org/
 RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
 line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
 language is Turtle. Apache Jena and Any23 provide software to handle all 
 these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
 This JIRA leaves open the strategy for loading RDF data into Giraph. There 
 are various possibilites, including exploitation of intermediate 
 Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
 more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
 or implementor notes here would be an advance on the current state of the 
 art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
 various RDF datasources.
 Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
 touches on the issue (since we can't currently easily represent fully general 
 RDF graphs since two nodes might be connected by more than one typed edge). 
 Even without multigraphs it ought to be possible to bring RDF-sourced data
 into Giraph, e.g. perhaps some app is only interested in say the Movies + 
 People subset of a big RDF collection.
 From Avery in email: a helper VertexInputFormat (and maybe 
 VertexOutputFormat) would certainly [despite GIRAPH-141] still help

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-77) Coordinator should expose a web interface with progress, vertex region assignments, etc.

2012-04-04 Thread Paolo Castagna (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-77?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246581#comment-13246581
 ] 

Paolo Castagna commented on GIRAPH-77:
--

Hi Avery, I am still learning and stepping into the Apache Giraph source code 
(fortunately, it isn't that big) :-)
Do you or Jakob have a favorite stack to do that? Jetty/Netty?, JAX-RS?, etc. 
Any specific web framework and/or template engine? Something small, something 
to minimize dependencies, ... I tend to use Jetty with plain servlets and 
Velocity. But I am open to suggestions.

Ideally, we could/should publish JSON and render HTML pages client side (once 
again, I accept suggestions on JavaScript frameworks).
I must warn you though, I am not a web|graphic designer (and I know my limits 
on the UI front). But, once the basic functionalities are in place and the 
correct data is available, I am sure some good web designer will fix that up.

Coming back to your question, with some guidance, yes. 
I would like to give it a shot and I have time to dedicate to Apache Giraph.

 Coordinator should expose a web interface with progress, vertex region 
 assignments, etc.
 

 Key: GIRAPH-77
 URL: https://issues.apache.org/jira/browse/GIRAPH-77
 Project: Giraph
  Issue Type: New Feature
Reporter: Jakob Homan

 It would be nice if the coordinator worker had a web interface that showed 
 progress, splits, etc. during job execution. Right now it would duplicate 
 information currently being exposed through task status, but with the move to 
 YARN, it will be a necessity.  It would be great if we could do this in a 
 modern way to avoid the screen-scraping, etc. currently used to get 
 information from most other Hadoop project's web interfaces.  The coordinator 
 could announce its address at the beginning or via status updates.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-77) Coordinator should expose a web interface with progress, vertex region assignments, etc.

2012-04-04 Thread Paolo Castagna (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-77?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13246604#comment-13246604
 ] 

Paolo Castagna commented on GIRAPH-77:
--

Ok, I'll look at that tomorrow (our CTO likes Sinatra ;-)). At least Scala 
integrate seamlessly with Java (fingers crossed... and I need to double check 
dependencies and side effects on the Maven front). Where is your code? Have you 
already started on this?

 Coordinator should expose a web interface with progress, vertex region 
 assignments, etc.
 

 Key: GIRAPH-77
 URL: https://issues.apache.org/jira/browse/GIRAPH-77
 Project: Giraph
  Issue Type: New Feature
Reporter: Jakob Homan

 It would be nice if the coordinator worker had a web interface that showed 
 progress, splits, etc. during job execution. Right now it would duplicate 
 information currently being exposed through task status, but with the move to 
 YARN, it will be a necessity.  It would be great if we could do this in a 
 modern way to avoid the screen-scraping, etc. currently used to get 
 information from most other Hadoop project's web interfaces.  The coordinator 
 could announce its address at the beginning or via status updates.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-141) mulitgraph support in giraph

2012-04-03 Thread Paolo Castagna (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245145#comment-13245145
 ] 

Paolo Castagna commented on GIRAPH-141:
---

Just to add another example of multigraph: 
[RDF|http://en.wikipedia.org/wiki/Resource_Description_Framework] data model is 
a labelled directed multigraph.

 mulitgraph support in giraph
 

 Key: GIRAPH-141
 URL: https://issues.apache.org/jira/browse/GIRAPH-141
 Project: Giraph
  Issue Type: Improvement
  Components: graph
Reporter: André Kelpe

 The current vertex API only supports simple graphs, meaning that there can 
 only ever be one edge between two vertices. Many graphs like the road network 
 are in fact multigraphs, where many edges can connect two vertices at the 
 same time.
 Support for this could be added by introducing an IteratorEdgeWritable 
 getEdgeValue() or a similar construct. Maybe introducing a slim object like a 
 Connector between the edge and the vertex is also a good idea, so that you 
 could do something like:
 {code} 
 for (final ConnectorEdgeWritable, VertexWritable conn: getEdgeValues(){
  final EdgeWritable edge = conn.getEdge();
  final VertexWritable otherVertex = conn.getOther();
  doInterestingStuff(otherVertex);
  doMoreInterestingStuff(edge);
 }
 {code} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira