Matt, I hadn't thought of the streaming use case for read/write. IO has been more about serialization of graphs, their related elements and in some cases, serialization of arbitrary objects (as needed by Gremlin Server). Not sure if the streaming use case is a TinkerPop responsibility or not.....more thinky think required i guess.
As it stands readGraph() is really not meant for incremental loading (it doesn't expect mutations to be occurring beyond what it is doing itself). I suppose it could/should be adapted as such with some more code - perhaps that is something for post-GA. As for writeGraph() and simultaneous operations, the write occurs over a simple iteration of all vertices, so i would suspect that if there were other transactional contexts at play, that such change would be visible dependent on how the Graph implementation handles such things. This read/writeGraph() feature is meant for small graphs right now. That's why marko brought up this thread as we need better and improved methods for dealing with more complex loads. On Thu, Apr 30, 2015 at 12:50 PM, Matt Frantz <[email protected]> wrote: > The questions that occur to me are somewhat broad, so I apologize if they > distract from the intended topic. However, I do feel they are related to a > proper IO design. > > Would the readGraph API be suitable for a continuously streaming loader, > e.g. to parse an activity stream, or is it only used for finite inputs? > > Would the writeGraph API be suitable for a continuously streaming > extractor, e.g. to write an external transaction log, or to synchronize a > replica, or is it only used for finite outputs? > > What is the expected behavior when there is simultaneous access, e.g. > queries occurring during readGraph, or mutations occurring during > writeGraph? > > On Thu, Apr 30, 2015 at 9:01 AM, Marko Rodriguez <[email protected]> > wrote: > > > Hi, > > > > Stephen is interested in making sure that Graph.io() works cleanly for > > both OLTP and OLAP. In particular, making sure that io().readGraph() and > > io().writeGraph() can be used in both OLTP and OLAP situations seamlessly > > much like Gremlin does for traversals. > > > > ------------ > > > > OLAP graph writing will occur via a (yet to be written) > > BulkLoaderVertexProgram. BulkLoaderVertexProgram takes a Graph (with > > vertices/edges) and writes to another Graph. In essence, two graphs, > where > > the first graph has the data and the second is empty. I always expected > > this to typically happen via Hadoop (HadoopGraph) -> VendorDatabase > > (VendorGraph). However, while most distributed graph database vendors > will > > leverage Hadoop/Giraph/Spark for their OLAP bulk loading operations > because > > of HDFS, we can't always assume this -- especially in the context of OLAP > > Graph.io(). > > > > Thus, BulkLoaderVertexProgram shouldn't just operate on Graph->Graph, but > > can optionally stream in a file as well, File->Graph. This means we have > to > > get into the concept of "InputSplits" at the gremlin-core level. A quick > > and dirty is to simply serially load the graph data from a file, this is > > not the optimal solution, but can move us forward on the Graph.io() API. > > > > To the API of Graph.io(). This would mean, like Traversal, the user can > > specify a Computer to use to do the readGraph(). > > > > graph.io().readGraph(file, graph.compute(MyGraphComputer.class)) > > > > For writeGraph() > > > > graph.io().writeGraph(file,graph.compute(MyGraphComputer.class)) > > > > > > Where, "file" can be a directory in both situations and each "worker" of > > the GraphComputer reads/writes a split. > > > > Thoughts?, > > Marko. > > > > http://markorodriguez.com > > > > >
