Re: Question about TextInputFormat pattern for parsing e.g. RDF

Avery Ching Mon, 12 Mar 2012 13:04:53 -0700

Sorry for the delayed response.  Responses inline.

Avery


On 3/8/12 7:14 AM, Benjamin Heitmann wrote:

Hello again,

I am wondering if it would be possible to parse RDF input files from a 
TextInputFormat class.

The most suitable text format for RDF is called "NTriples", and it has this 
very simple format:

subject1 predicate1 object1 .\n
subject1 predicate2 object2 .\n
...

So each line contains the subject, which is a vertex, a predicate, which is a 
typed edge, and the object, which is another vertex.
Then the line is terminated by a dot and a new-line.

In Giraph terms, the result of parsing the first line would be the creation of 
a vertex for subject1 with an edge of type predicate1,
and then the creation of a second vertex for object1. So two vertices need to 
be created for that one line.

Now the second line contains more information about the vertex subject1.
So in Giraph terms, the vertex which was created for subject1 needs to be 
retrieved/revisited and an edge of type predicate2,
which points to the new vertex object2 needs to be created. And vertex object2 
needs to be created.

Just to point it out, such RDF NTriples files are unsorted, so information 
about the same vertex might appear e.g. at the first and at the last line
of a multiple GB big file.

Which interface can be used in a TextInputFormat/VertexReader in order to find 
an already created vertex ?

This is not possible unfortunately. It's similar to the HadoopInputFormat. Vertices (analogous to key-value pairs) are read one at atime. They are not saved for later access (just like Hadoop).

Are there any other issues when VertexReader.getCurrentVertex() creates two 
vertices at the same time ?


A second related question:
If I have multiple formats for my input files, how would I implement that ?
Just by adding a switch to the logic in getCurrentVertex() ? Or is there a 
better way to switch the input logic based on the file type ?
All my input files would result in the same kind of Vertex being created.


My motivation for doing this, in short:
I have a large amount of RDF NTriples data which is provided by DBPedia. It 
amounts to somewhere between 5 GB and 20 GB,
depending on which subset is used. Expressing this RDF data, so that each 
vertex is completely described in one text line,
would require me to load it into an RDF store first, and then reprocess the 
data. In terms of RDF stores, that is already a non-trivial amount of data
requiring quite a bit of hardware and tweaking. That is the reason why it would 
be valuable to directly load the RDF data into Giraph.


My suggestion would be the following:

Run a MR job to join all your RDFs on the vertex key and you can eitherconvert them to an easy format to parse with a custom VertexInputFormatof your choice. If these are one way relationships, you need not createthe target vertex. If they are undirect relationships, when you areprocessing your RDFs in the MR job, add a directed relationship to bothvertices.

cheers, Benjamin.

Re: Question about TextInputFormat pattern for parsing e.g. RDF

Reply via email to