[
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249590#comment-13249590
]
Dan Brickley commented on GIRAPH-170:
-------------------------------------
Paolo (spelled right this time... sorry!), does Pig sound like an appropriate
tool for that sort of pre-processing? I thought I'd seen some graph
manipulation code around somewhere that might do the ntriples to adjacency list
work, but can't find the link. Closest I've found is
http://thedatachef.blogspot.com/2011/05/structural-similarity-with-apache-pig.html
https://github.com/ogrisel/pignlproc also has some code for ntriples parsing
from Pig, e.g.
https://github.com/ogrisel/pignlproc/blob/master/src/main/java/pignlproc/storage/UriUriNTriplesLoader.java
though it doesn't (from quick look) seem to handle literal values.
> Workflow for loading RDF graph data into Giraph
> -----------------------------------------------
>
> Key: GIRAPH-170
> URL: https://issues.apache.org/jira/browse/GIRAPH-170
> Project: Giraph
> Issue Type: New Feature
> Reporter: Dan Brickley
> Priority: Minor
>
> W3C RDF provides a family of Web standards for exchanging graph-based data.
> RDF uses sets of simple binary relationships, labeling nodes and links with
> Web identifiers (URIs). Many public datasets are available as RDF, including
> the "Linked Data" cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many
> such datasets are listed at http://thedatahub.org/
> RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple
> line-oriented format is N-Triples. A format aligned with RDF's SPARQL query
> language is Turtle. Apache Jena and Any23 provide software to handle all
> these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
> This JIRA leaves open the strategy for loading RDF data into Giraph. There
> are various possibilites, including exploitation of intermediate
> Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a
> more Giraph-friendly form, or writing custom loaders. Even a HOWTO document
> or implementor notes here would be an advance on the current state of the
> art. The BluePrints Graph API (Gremlin etc.) has also been aligned with
> various RDF datasources.
> Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141
> touches on the issue (since we can't currently easily represent fully general
> RDF graphs since two nodes might be connected by more than one typed edge).
> Even without multigraphs it ought to be possible to bring RDF-sourced data
> into Giraph, e.g. perhaps some app is only interested in say the Movies +
> People subset of a big RDF collection.
> From Avery in email: "a helper VertexInputFormat (and maybe
> VertexOutputFormat) would certainly [despite GIRAPH-141] still help"
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira