Hi Marko, I think in principle this could work but my reservations are: 1) Every GraphComputer implementation would have to implement a lot of redundant logic to write graph data into an arbitrary graph 2) Can this be done efficiently at scale? There are lot of tweaks and custom code that is typically needed to efficiently write lots of graph data into a target graph database. In particular if you are thinking about large graph data sets like those generated by Hadoop,Giraph or Spark. I think it won't be reasonable to generalize the "fast bulk loading" logic into each and ever GraphComputer implementation. 3) Redundancy: Most vendors will already have a BulkLoaderVertexProgram of some sort.
Wouldn't it be better if we allowed chaining of VertexPrograms? In other words, run a PageRank first and then write the results into whatever target graph you want through a BulkLoaderVertexProgram that can be tuned/customized for a particular target graph? Then we could introduce optimizations, like "if you only want to persist properties then you can only need one iteration of the BLVP and we run it together with the last iteration of the previous VP". Thoughts? Matthias On Mon, May 4, 2015 at 8:38 AM Marko Rodriguez <[email protected]> wrote: > Hi, > > On Friday, I was working with Dan LaRocque on getting > Spark/GiraphGraphComputers working over Titan. Luckily, it was pretty > trivial to do. However, a few quirks emerged around "the resultant graph" > that I think would be nice to solve. > > In GraphComputer, we have two enums: Persist{NOTHING, VERTEX_PROPERTIES, > EDGES} and ResultGraph{ORIGINAL,NEW}. > > I think we should modify this a bit to make cross vendor use of > GraphComputers cleaner. Moreover, I think we can get a lot of leverage > using the Attachable interface of I/O. See what you think: > > > graph.compute().program(PageRankVertexProgram).result(configuration, > (destinationGraph, attachableVertex) -> > Attachable.Method.create(destinationGraph)).submit() > > What does this mean? > > We should get rid of the GraphComputer.persist() and > GraphComputer.resultGraph() methods and replace it with a > GraphComputer.result() method. This method takes a Configuration which is > used to construct a Graph via GraphFactory.open(configuration). Thus, OLTP > databases like Titan/Neo4j/etc., this would be a connection to the > database. For, TinkerGraph, the configuration would have the graph object > in it (via setProperty()) or a completely "new TinkerGraph()." For HDFS > graphs, the configuration would be the directory of the place to write the > graph data (in essence, just a standard HadoopGraph properties file). When > we get ServerGraph implemented, this would just be a GremlinServer > connection. If no result() is provided, then ComputerResult.getGraph() > would just return EmptyGraph.instance(). So, now this generalizes the > ResultGraph.ORIGINAL/NEW situation, where a GraphComputer's resultant graph > can write to any Graph -- i.e., vendor agnostic. For instance, Titan's > FulgoraGraphComputer could, in principle, write its compute graph result > out to Neo4j. > > Next, the BiFunction provided is how to write the computed vertex to the > result graph. It would be great if this was all via Attachable, but then > that assumes all vendors are operating on Attachable vertices, which isn't > the case for TinkerGraph nor Titan. This is where Stephen and I would need > to think, but, in general, its simply a "getOrCreate"-style method for > taking the vertex and writing it to the destination. Like Attachable, we > could have static methods for common use cases -- > XXX.writeVertexProperties(), XXX.writeVertexProperties(Map<String,String> > propertyConverter), XXX.writeVertexPropertiesAndEdges(), etc. This way, its > up to the end user to determine how they want the results to be handled and > again, its vendor-agnostic (Graph -> Graph --- what those Graph instances > are, who cares). > > Thoughts?, > Marko. > > http://markorodriguez.com > >
