Hi Matthias,
> 3) Redundancy: Most vendors will already have a BulkLoaderVertexProgram of
> some sort.
Yes. The more I think about it, I don't think we will be able to do:
https://issues.apache.org/jira/browse/TINKERPOP3-319
As you say, each vendor will be way too specific about how to deal with
transactions, id resolution, schemas, etc.
Next, I don't like the concept of "BulkLoader". I think TP3 should provide a
GraphMutator interface. This interface would provide lots of static "helpers"
for vendor's implementations. Next, I/O would have something like this:
Class<? extends GraphMutator> graph.io().getGraphMutator()
From there, the messages of GraphMutator would be an Mutation class.
Mutation -- enum { ADD, DELETE, UPDATE }
…something something.
Now, this would support not only bulk loading of an empty graph, but also
mutating an existing graph. Currently, Gremlin OLAP has no way of doing
addV(),addE(),etc. with TraversalVertexProgram. However, if we assume
GraphMutator is a common class provided by vendors, we could have it so people
can then do this:
graph.compute(SparkGraphComputer.class).result(anotherGraph,somePredicateToDetermineWhichElementShouldBePersisted).program(PageRankVertexProgram).submit()
Now, GraphMutator is "special" in that once a VertexProgram finishes, and if
there is a specific result(), then GraphMutator takes over the next BSP
iteration. The GraphMutator instance is pulled via
anotherGraph.io().getGraphMutator().
If you go Hadoop--Spark-->Titan, Spark would pull the graph from Hadoop,
execute PageRankVertexProgram and then use TitanGraphMutator to write the
result properties.
If you go Titan--Fulgora->Titan, Fulgora would pull the graph from Titan,
execute PageRankVertexProgram and then use TitanGraphMutator to write the
result properties.
If you go Hadoop->Giraph->Hadoop, Hadoop would pull the graph fro Hadoop,
execute PageRankVertexProgram and then use HadoopGraphMutator to write result
properties.
- in this situation, the second HadoopGraph would simply be a
properties file saying the graph.location. If that graph.location to write the
sequence file to.
Thoughts?,
Marko.
http://markorodriguez.com
>
> Wouldn't it be better if we allowed chaining of VertexPrograms? In other
> words, run a PageRank first and then write the results into whatever target
> graph you want through a BulkLoaderVertexProgram that can be
> tuned/customized for a particular target graph?
>
> Then we could introduce optimizations, like "if you only want to persist
> properties then you can only need one iteration of the BLVP and we run it
> together with the last iteration of the previous VP".
>
> Thoughts?
> Matthias
>
> On Mon, May 4, 2015 at 8:38 AM Marko Rodriguez <[email protected]> wrote:
>
>> Hi,
>>
>> On Friday, I was working with Dan LaRocque on getting
>> Spark/GiraphGraphComputers working over Titan. Luckily, it was pretty
>> trivial to do. However, a few quirks emerged around "the resultant graph"
>> that I think would be nice to solve.
>>
>> In GraphComputer, we have two enums: Persist{NOTHING, VERTEX_PROPERTIES,
>> EDGES} and ResultGraph{ORIGINAL,NEW}.
>>
>> I think we should modify this a bit to make cross vendor use of
>> GraphComputers cleaner. Moreover, I think we can get a lot of leverage
>> using the Attachable interface of I/O. See what you think:
>>
>>
>> graph.compute().program(PageRankVertexProgram).result(configuration,
>> (destinationGraph, attachableVertex) ->
>> Attachable.Method.create(destinationGraph)).submit()
>>
>> What does this mean?
>>
>> We should get rid of the GraphComputer.persist() and
>> GraphComputer.resultGraph() methods and replace it with a
>> GraphComputer.result() method. This method takes a Configuration which is
>> used to construct a Graph via GraphFactory.open(configuration). Thus, OLTP
>> databases like Titan/Neo4j/etc., this would be a connection to the
>> database. For, TinkerGraph, the configuration would have the graph object
>> in it (via setProperty()) or a completely "new TinkerGraph()." For HDFS
>> graphs, the configuration would be the directory of the place to write the
>> graph data (in essence, just a standard HadoopGraph properties file). When
>> we get ServerGraph implemented, this would just be a GremlinServer
>> connection. If no result() is provided, then ComputerResult.getGraph()
>> would just return EmptyGraph.instance(). So, now this generalizes the
>> ResultGraph.ORIGINAL/NEW situation, where a GraphComputer's resultant graph
>> can write to any Graph -- i.e., vendor agnostic. For instance, Titan's
>> FulgoraGraphComputer could, in principle, write its compute graph result
>> out to Neo4j.
>>
>> Next, the BiFunction provided is how to write the computed vertex to the
>> result graph. It would be great if this was all via Attachable, but then
>> that assumes all vendors are operating on Attachable vertices, which isn't
>> the case for TinkerGraph nor Titan. This is where Stephen and I would need
>> to think, but, in general, its simply a "getOrCreate"-style method for
>> taking the vertex and writing it to the destination. Like Attachable, we
>> could have static methods for common use cases --
>> XXX.writeVertexProperties(), XXX.writeVertexProperties(Map<String,String>
>> propertyConverter), XXX.writeVertexPropertiesAndEdges(), etc. This way, its
>> up to the end user to determine how they want the results to be handled and
>> again, its vendor-agnostic (Graph -> Graph --- what those Graph instances
>> are, who cares).
>>
>> Thoughts?,
>> Marko.
>>
>> http://markorodriguez.com
>>
>>