Thanks for your input.  Response inline.


On 3/13/12 7:14 AM, Dionysios Logothetis wrote:
Hi all,
I'm a new Giraph user, and I'm facing a similar situation. My input graph is basically in the form of edges defined simply as a source and destination pair (optionally there could be an edge value). And these edges might be distributed across multiple files (this is actually a format I've seen in several graph data sets).

Without having looked at the internals of Giraph, I originally imagined that creating a MutableVertex and calling addVertexRequest for both vertices in an edge and addEdgeRequest from within the VertexReader would do the trick.

I agree that this idea can work, we also have to have a default vertex value in case folks add edges to a vertex index only.

Now, this doesn't really work since there needs to be a graph state created in advance. The graph state is not created until all vertices have been loaded.
I wouldn't work about graph state here since it's the input superstep. We can set it for all vertices after creation if need be.

There's also another implication with potentially multiple workers trying to create the same vertex, but I think a vertex resolver can handle this, assuming the resolver is instantiated before the vertices are loaded.


Is there a workaround to do this currently apart from pre-processing the graph?

Not currently. Can you please open a JIRA on to put track this issue? I think we should do it.

Do you think it would be useful to have such functionality?


I think it makes sense to handle graph mutations either at the very beginning or during a execution in a uniform way. By the way, I'd be interested in contributing to the project.

We'd love to have your contributions, it's a great fit. =)

Looking forward to your response!


On Mon, Mar 12, 2012 at 9:09 PM, Avery Ching < <>> wrote:


    By the way, you're not the first to ask for a feature of this
    kind.  Perhaps we should consider an alternative format for
    loading input vertex data that is based on the edges or data of
    the vertices rather than totally vertex-centric.  We could load an
    edge, or a vertex value and join then all based on the vertex id.
     Handling conflicts could be a little difficult, but perhaps the
    vertex resolver could handle this as well.


    On 3/12/12 12:41 PM, Benjamin Heitmann wrote:

        On 12 Mar 2012, at 18:15, David Garcia wrote:

            Not sure what you're asking about.  getCurrentVertex()
            should only ever
            create one vertex.  Presumably it returns this vertex to
            the calling
            function. . .which is called in loadVertices() I think.

        Thanks David.

        I am asking this question because I have a text input format
        which is very different from a node adjacency list.
        The most important difference, is that each line of the input
        file describes two nodes.
        The other important difference is that a node might be
        described on more then one line of the input.

        I have multiple gigabits of input, so it would be very
        beneficial to directly load the input into Giraph.
        Otherwise the overhead of converting the input to some sort of
        node adjacency list is so big,
        that it might be a show-stopper regarding the suitability of

For more details, here is the text from my previous email: =========================[snip]===========

        I am wondering if it would be possible to parse RDF input
        files from a TextInputFormat class.

        The most suitable text format for RDF is called "NTriples",
        and it has this very simple format:

        subject1 predicate1 object1 .\n
        subject1 predicate2 object2 .\n

        So each line contains the subject, which is a vertex, a
        predicate, which is a typed edge, and the object, which is
        another vertex.
        Then the line is terminated by a dot and a new-line.

        In Giraph terms, the result of parsing the first line would be
        the creation of a vertex for subject1 with an edge of type
        and then the creation of a second vertex for object1. So two
        vertices need to be created for that one line.

        Now the second line contains more information about the vertex
        So in Giraph terms, the vertex which was created for subject1
        needs to be retrieved/revisited and an edge of type predicate2,
        which points to the new vertex object2 needs to be created.
        And vertex object2 needs to be created.

        Just to point it out, such RDF NTriples files are unsorted, so
        information about the same vertex might appear e.g. at the
        first and at the last line
        of a multiple GB big file.

        Which interface can be used in a TextInputFormat/VertexReader
        in order to find an already created vertex ?

        Are there any other issues when
        VertexReader.getCurrentVertex() creates two vertices at the
        same time ?

        A second related question:
        If I have multiple formats for my input files, how would I
        implement that ?
        Just by adding a switch to the logic in getCurrentVertex() ?
        Or is there a better way to switch the input logic based on
        the file type ?
        All my input files would result in the same kind of Vertex
        being created.

        My motivation for doing this, in short:
        I have a large amount of RDF NTriples data which is provided
        by DBPedia. It amounts to somewhere between 5 GB and 20 GB,
        depending on which subset is used. Expressing this RDF data,
        so that each vertex is completely described in one text line,
        would require me to load it into an RDF store first, and then
        reprocess the data. In terms of RDF stores, that is already a
        non-trivial amount of data
        requiring quite a bit of hardware and tweaking. That is the
        reason why it would be valuable to directly load the RDF data
        into Giraph.

Reply via email to