IMO it would be best if graph providers would implement a GraphOutputFormat for their graph implementation. This way we could rely on BulkDumperVertexProgram [1,2], which only relies on an InputFormat and an OutputFormat and thus can be seen as kind of a [Copy|Clone]VertexProgram. If that's not an option, then graph providers could still create their own VP, that is optimized to handle transactions, id assignments, etc. properly in the underlying graph DB implementation.
[1] http://tinkerpop.apache.org/docs/current/reference/#bulkdumpervertexprogram [2] https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/computer/bulkdumping/BulkDumperVertexProgram.java Cheers, Daniel On Thu, Jun 7, 2018 at 8:53 AM, Stephen Mallette <[email protected]> wrote: > TinkerPop tries to generalize various aspects of graph computing and does a > pretty good job of doing so, but every so often we try to generalize > something and it just doesn't work the way we'd like. Indexing was one such > casualty, if you need an example to consider, but I think that our attempt > at bulk loading is falling into that area as well, specifically: > BulkLoaderVertexProgram (BLVP): > > http://tinkerpop.apache.org/docs/current/reference/# > bulkloadervertexprogram > > What I'm seeing is that graph providers are offering their own bulk loading > tools which are inevitably faster and/or easier to use that BLVP. Here's > some examples: > > CosmosDB: https://github.com/Microsoft/Microsoft.Azure.Graphs.BulkImport > Neptune: https://docs.aws.amazon.com/neptune/latest/userguide/bulk- > load.html > Neo4j: https://neo4j.com/blog/bulk-data-import-neo4j-3-0/ > DSE Graph: > https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_ > enterprise/graph/dgl/dglOverview.html > JanusGraph: https://docs.janusgraph.org/0.2.0/bulk-loading.html > > I suppose there are others, but hopefully those examples convey the point. > Of those I mentioned, perhaps the JanusGraph one is a bit of a stretch as > its documentation references hadoop-gremlin which I presume means BLVP. > Maybe someone on JanusGraph can comment a bit further. > > In addition to graph providers having their own approaches to bulk loading, > I tend to find that BLVP is always a question mark for users. They tend to > have problems getting it working right and we really haven't done much to > improve its usage. > > So, given all that, would it be a bad idea to get TinkerPop out of the > business of trying to generalize bulk loading? If we did, that would be one > less feature to support and we could arguably recommend to users a better > experience by instructing them to use the bulk loader of their graph of > choice. I suppose that the downside to taking this stance would be that > graph providers that don't provide bulk loaders couldn't rely on TinkerPop > anymore for this need (JanusGraph? others?). Finally, users would not have > a single general way to bulk load to any graph implementation. Perhaps > there is a way to do that without BLVP in place? >
