Dan, you're definitely right that this has been mentioned a few times.
The multigraph issue is one part of it, but a helper VertexInputFormat
(and maybe VertexOutputFormat) would certainly still help as you
mention. Can you please open a JIRA (and help if you have time)?
On 4/5/12 1:49 AM, Dan Brickley wrote:
On 5 April 2012 05:49, Jakob Homan<jgho...@gmail.com> wrote:
Ack!, I suck. Sorry. I hadn't realized we'd gone through most of
them, which itself is a good thing. I'll get some new ones added
first thing in the morning. Sorry.
Do we have something around "document a workflow to get RDF graph data
into Giraph?". A few of us have been talking about it here or there,
and I've heard various strategies mentioned (e.g. Ntriples as it's a
simple line-oriented format; piggybacking on HBase or other storage
that Giraph already has adaptors for; integrating Apache Jena; ...). I
can't find much in JIRA but
https://issues.apache.org/jira/browse/GIRAPH-141 touches on the issue
(since we can't currently easily represent fully general RDF graphs
since two nodes might be connected by more than one typed edge). Even
without multigraphs it ought to be possible to bring RDF-sourced data
into Giraph, e.g. perhaps some app is only interested in say the
Movies + People subset of a big RDF collection. And so perhaps most of
the work is in preprocessing for now - e.g. via Ntriples + Pig; but
still it would be great to have a clear HOWTO.
As an interested party on the periphery, a JIRA for this would give a
natural place to monitor, read up, maybe even help. And I'm sure I'm