Re: [DISCUSS] Bulk Loading

Daniel Kuppitz Thu, 07 Jun 2018 09:32:45 -0700

IMO it would be best if graph providers would implement a GraphOutputFormat
for their graph implementation. This way we could rely on
BulkDumperVertexProgram [1,2], which only relies on an InputFormat and an
OutputFormat and thus can be seen as kind of a [Copy|Clone]VertexProgram.
If that's not an option, then graph providers could still create their own
VP, that is optimized to handle transactions, id assignments, etc. properly
in the underlying graph DB implementation.


[1]
http://tinkerpop.apache.org/docs/current/reference/#bulkdumpervertexprogram
[2]
https://github.com/apache/tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/computer/bulkdumping/BulkDumperVertexProgram.java

Cheers,
Daniel


On Thu, Jun 7, 2018 at 8:53 AM, Stephen Mallette <[email protected]>
wrote:

> TinkerPop tries to generalize various aspects of graph computing and does a
> pretty good job of doing so, but every so often we try to generalize
> something and it just doesn't work the way we'd like. Indexing was one such
> casualty, if you need an example to consider, but I think that our attempt
> at bulk loading is falling into that area as well, specifically:
> BulkLoaderVertexProgram (BLVP):
>
> http://tinkerpop.apache.org/docs/current/reference/#
> bulkloadervertexprogram
>
> What I'm seeing is that graph providers are offering their own bulk loading
> tools which are inevitably faster and/or easier to use that BLVP. Here's
> some examples:
>
> CosmosDB: https://github.com/Microsoft/Microsoft.Azure.Graphs.BulkImport
> Neptune: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-
> load.html
> Neo4j: https://neo4j.com/blog/bulk-data-import-neo4j-3-0/
> DSE Graph:
> https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_
> enterprise/graph/dgl/dglOverview.html
> JanusGraph: https://docs.janusgraph.org/0.2.0/bulk-loading.html
>
> I suppose there are others, but hopefully those examples convey the point.
> Of those I mentioned, perhaps the JanusGraph one is a bit of a stretch as
> its documentation references hadoop-gremlin which I presume means BLVP.
> Maybe someone on JanusGraph can comment a bit further.
>
> In addition to graph providers having their own approaches to bulk loading,
> I tend to find that BLVP is always a question mark for users. They tend to
> have problems getting it working right and we really haven't done much to
> improve its usage.
>
> So, given all that, would it be a bad idea to get TinkerPop out of the
> business of trying to generalize bulk loading? If we did, that would be one
> less feature to support and we could arguably recommend to users a better
> experience by instructing them to use the bulk loader of their graph of
> choice. I suppose that the downside to taking this stance would be that
> graph providers that don't provide bulk loaders couldn't rely on TinkerPop
> anymore for this need (JanusGraph? others?). Finally, users would not have
> a single general way to bulk load to any graph implementation. Perhaps
> there is a way to do that without BLVP in place?
>

Re: [DISCUSS] Bulk Loading

Reply via email to