Re: Calling BspUtils.createVertexResolver from a TextVertexReader ?

Avery Ching Fri, 16 Mar 2012 00:07:35 -0700

If you found it useful, others might find it useful as well. Pleasefeel free to add to a JIRA.


Avery


On 3/15/12 4:44 AM, Dionysis Logothetis wrote:

Ok, I've created an issue:https://issues.apache.org/jira/browse/GIRAPH-155

Feel free to edit if you think the description is not clear.

By the way, I have also created a vertex reader that reads adjacencylists but with no values for vertices and edges. That's also a formatthat I've seen in several graph data sets. The vertex reader isessentially a copy of the AdjacencyListVertexReader modified to handlethis format. It's basically an abstract class and subclasses canoverride methods to provide default values for vertices and edges(otherwise values are initialized to null), just like Avery describedbelow. If you think it's useful I can contribute this.

On Wed, Mar 14, 2012 at 7:39 AM, Avery Ching <ach...@apache.org<mailto:ach...@apache.org>> wrote:


    Thanks for your input.  Response inline.

    Avery


    On 3/13/12 7:14 AM, Dionysios Logothetis wrote:

    Hi all,
    I'm a new Giraph user, and I'm facing a similar situation. My
    input graph is basically in the form of edges defined simply as a
    source and destination pair (optionally there could be an edge
    value). And these edges might be distributed across multiple
    files (this is actually a format I've seen in several graph data
    sets).

    Without having looked at the internals of Giraph, I originally
    imagined that creating a MutableVertex and calling
    addVertexRequest for both vertices in an edge and addEdgeRequest
    from within the VertexReader would do the trick.

    I agree that this idea can work, we also have to have a default
    vertex value in case folks add edges to a vertex index only.

    Now, this doesn't really work since there needs to be a graph
    state created in advance. The graph state is not created until
    all vertices have been loaded.

    I wouldn't work about graph state here since it's the input
    superstep.  We can set it for all vertices after creation if need be.


    There's also another implication with
    potentially multiple workers trying to create the same vertex,
    but I think a vertex resolver can handle this, assuming the
    resolver is instantiated before the vertices are loaded.

    Yup.

    Is there a workaround to do this currently apart from
    pre-processing the graph?


    Not currently.  Can you please open a JIRA on
    https://issues.apache.org/jira/browse/GIRAPH to put track this
    issue?  I think we should do it.

    Do you think it would be useful to have such functionality?


    Yes!

    I think it makes sense to handle graph mutations either at the
    very beginning or during a execution in a uniform way. By the
    way, I'd be interested in contributing to the project.


    We'd love to have your contributions, it's a great fit. =)


    Looking forward to your response!

    Thanks!


    On Mon, Mar 12, 2012 at 9:09 PM, Avery Ching <ach...@apache.org
    <mailto:ach...@apache.org>> wrote:

        Benjamin,

        By the way, you're not the first to ask for a feature of this
        kind.  Perhaps we should consider an alternative format for
        loading input vertex data that is based on the edges or data
        of the vertices rather than totally vertex-centric.  We could
        load an edge, or a vertex value and join then all based on
        the vertex id.  Handling conflicts could be a little
        difficult, but perhaps the vertex resolver could handle this
        as well.

        Avery


        On 3/12/12 12:41 PM, Benjamin Heitmann wrote:

            On 12 Mar 2012, at 18:15, David Garcia wrote:

                Not sure what you're asking about.
                 getCurrentVertex() should only ever
                create one vertex.  Presumably it returns this vertex
                to the calling
                function. . .which is called in loadVertices() I think.

            Thanks David.

            I am asking this question because I have a text input
            format which is very different from a node adjacency list.
            The most important difference, is that each line of the
            input file describes two nodes.
            The other important difference is that a node might be
            described on more then one line of the input.

            I have multiple gigabits of input, so it would be very
            beneficial to directly load the input into Giraph.
            Otherwise the overhead of converting the input to some
            sort of node adjacency list is so big,
            that it might be a show-stopper regarding the suitability
            of Giraph.







            For more details, here is the text from my previous
            email:   =========================[snip]===========

            I am wondering if it would be possible to parse RDF input
            files from a TextInputFormat class.

            The most suitable text format for RDF is called
            "NTriples", and it has this very simple format:

            subject1 predicate1 object1 .\n
            subject1 predicate2 object2 .\n
            ...

            So each line contains the subject, which is a vertex, a
            predicate, which is a typed edge, and the object, which
            is another vertex.
            Then the line is terminated by a dot and a new-line.

            In Giraph terms, the result of parsing the first line
            would be the creation of a vertex for subject1 with an
            edge of type predicate1,
            and then the creation of a second vertex for object1. So
            two vertices need to be created for that one line.

            Now the second line contains more information about the
            vertex subject1.
            So in Giraph terms, the vertex which was created for
            subject1 needs to be retrieved/revisited and an edge of
            type predicate2,
            which points to the new vertex object2 needs to be
            created. And vertex object2 needs to be created.

            Just to point it out, such RDF NTriples files are
            unsorted, so information about the same vertex might
            appear e.g. at the first and at the last line
            of a multiple GB big file.

            Which interface can be used in a
            TextInputFormat/VertexReader in order to find an already
            created vertex ?

            Are there any other issues when
            VertexReader.getCurrentVertex() creates two vertices at
            the same time ?


            A second related question:
            If I have multiple formats for my input files, how would
            I implement that ?
            Just by adding a switch to the logic in
            getCurrentVertex() ? Or is there a better way to switch
            the input logic based on the file type ?
            All my input files would result in the same kind of
            Vertex being created.


            My motivation for doing this, in short:
            I have a large amount of RDF NTriples data which is
            provided by DBPedia. It amounts to somewhere between 5 GB
            and 20 GB,
            depending on which subset is used. Expressing this RDF
            data, so that each vertex is completely described in one
            text line,
            would require me to load it into an RDF store first, and
            then reprocess the data. In terms of RDF stores, that is
            already a non-trivial amount of data
            requiring quite a bit of hardware and tweaking. That is
            the reason why it would be valuable to directly load the
            RDF data into Giraph.

Re: Calling BspUtils.createVertexResolver from a TextVertexReader ?

Reply via email to