That's a nice idea. I think that gets us out most of the way out of bulk loading while still providing a method for providers who want an "easy" way to offer a bulk loader. I sense that we wouldn't directly promote this feature to users and we would leave it to graph providers to present that information in their own documentation. I think we'd just write up something in Provider Documentation so that folks are aware of how it works and what they need to do to take advantage of it.
On Thu, Jun 7, 2018 at 12:32 PM Daniel Kuppitz <m...@gremlin.guru> wrote: > IMO it would be best if graph providers would implement a GraphOutputFormat > for their graph implementation. This way we could rely on > BulkDumperVertexProgram [1,2], which only relies on an InputFormat and an > OutputFormat and thus can be seen as kind of a [Copy|Clone]VertexProgram. > If that's not an option, then graph providers could still create their own > VP, that is optimized to handle transactions, id assignments, etc. properly > in the underlying graph DB implementation. > >  > http://tinkerpop.apache.org/docs/current/reference/#bulkdumpervertexprogram >  > > https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/computer/bulkdumping/BulkDumperVertexProgram.java > > Cheers, > Daniel > > > On Thu, Jun 7, 2018 at 8:53 AM, Stephen Mallette <spmalle...@gmail.com> > wrote: > > > TinkerPop tries to generalize various aspects of graph computing and > does a > > pretty good job of doing so, but every so often we try to generalize > > something and it just doesn't work the way we'd like. Indexing was one > such > > casualty, if you need an example to consider, but I think that our > attempt > > at bulk loading is falling into that area as well, specifically: > > BulkLoaderVertexProgram (BLVP): > > > > http://tinkerpop.apache.org/docs/current/reference/# > > bulkloadervertexprogram > > > > What I'm seeing is that graph providers are offering their own bulk > loading > > tools which are inevitably faster and/or easier to use that BLVP. Here's > > some examples: > > > > CosmosDB: https://github.com/Microsoft/Microsoft.Azure.Graphs.BulkImport > > Neptune: https://docs.aws.amazon.com/neptune/latest/userguide/bulk- > > load.html > > Neo4j: https://neo4j.com/blog/bulk-data-import-neo4j-3-0/ > > DSE Graph: > > https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_ > > enterprise/graph/dgl/dglOverview.html > > JanusGraph: https://docs.janusgraph.org/0.2.0/bulk-loading.html > > > > I suppose there are others, but hopefully those examples convey the > point. > > Of those I mentioned, perhaps the JanusGraph one is a bit of a stretch as > > its documentation references hadoop-gremlin which I presume means BLVP. > > Maybe someone on JanusGraph can comment a bit further. > > > > In addition to graph providers having their own approaches to bulk > loading, > > I tend to find that BLVP is always a question mark for users. They tend > to > > have problems getting it working right and we really haven't done much to > > improve its usage. > > > > So, given all that, would it be a bad idea to get TinkerPop out of the > > business of trying to generalize bulk loading? If we did, that would be > one > > less feature to support and we could arguably recommend to users a better > > experience by instructing them to use the bulk loader of their graph of > > choice. I suppose that the downside to taking this stance would be that > > graph providers that don't provide bulk loaders couldn't rely on > TinkerPop > > anymore for this need (JanusGraph? others?). Finally, users would not > have > > a single general way to bulk load to any graph implementation. Perhaps > > there is a way to do that without BLVP in place? > > >