Re: [TinkerPop3] Bulk Loading

Daniel Kuppitz Mon, 31 Aug 2015 14:06:11 -0700

Done done. Functionality again validated by several manual tests.

Cheers,
Daniel


On Mon, Aug 31, 2015 at 9:36 PM, Marko Rodriguez <[email protected]>
wrote:

> Thanks.
>
> Can you have the builder's fluid methods be namespaces to
> bulkLoaderVertexProgram, please.
>
> gremlin.bulkLoaderVertexProgram.vertexIdProperty=
> gremlin.bulkLoaderVertexProgram.keepOriginalIds=
>
> Next, I would make your writeGraph part prefixed accordingly. You simply
> do "graph". Is that sufficient? HadoopGraph and TitanFactory are going to
> be vying for that namespace. Perhaps, given that you are nicely loading
> this from a preexisting configuration file, just use:
>
> gremlin.bulkLoaderVertexProgram namespace again.
>
>
> gremlin.bulkLoaderVertexProgram.writeGraph.graph
> gremlin.bulkLoaderVertexProgram.writeGraph.storage.backend
> gremlin.bulkLoaderVertexProgram.writeGraph.storage.hostname
> …
> etc.
>
> Thanks,
> Marko.
>
> http://markorodriguez.com
>
> On Aug 31, 2015, at 1:11 PM, Daniel Kuppitz <[email protected]> wrote:
>
> > Under the hood the last example generates this configuration:
> >
> > *# from hadoop-script.properties:*
> > gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
> >
> gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
> >
> gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
> >
> gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
> > gremlin.hadoop.jarsInDistributedCache=true
> > gremlin.hadoop.inputLocation=grateful-dead.txt
> > gremlin.hadoop.scriptInputFormat.script=script-input-grateful-dead.groovy
> > gremlin.hadoop.outputLocation=output
> > spark.master=local[4]
> > spark.executor.memory=1g
> > spark.serializer=org.apache.spark.serializer.KryoSerializer
> >
> > *# from the builder's fluent methods:*
> > loader.vertexIdProperty="bulkloader.vertex.id"
> > loader.keepOriginalIds=false
> > loader.intermediateBatchSize=10000
> >
> > *# from the writeGraph method:*
> > graph.gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
> > graph.storage.backend=cassandra
> > graph.storage.hostname=127.0.0.1
> > graph.storage.batch-loading=true
> >
> >
> > What BulkLoaderVertexProgram ultimately gets from the Builder is:
> >
> > loader.vertexIdProperty="bulkloader.vertex.id"
> > loader.keepOriginalIds=false
> > loader.intermediateBatchSize=10000
> > graph.gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
> > graph.storage.backend=cassandra
> > graph.storage.hostname=127.0.0.1
> > graph.storage.batch-loading=true
> >
> >
> >
> > The loader subset is then passed to the BulkLoader implementation, the
> graph
> > subset is passed to GraphFactory.open().
> >
> > Cheers,
> > Daniel
> >
> >
> >
> > On Mon, Aug 31, 2015 at 8:55 PM, Marko Rodriguez <[email protected]>
> > wrote:
> >
> >> Hi Daniel,
> >>
> >> That looks really good. Just for closure, can you show me what your
> >> properties file looks like after it gets "unrolled" by BLVP? Even if you
> >> fat finger in an few lines for example.
> >>
> >> Thanks,
> >> Marko.
> >>
> >> http://markorodriguez.com
> >>
> >> On Aug 31, 2015, at 12:39 PM, Daniel Kuppitz <[email protected]> wrote:
> >>
> >>> Okay, most of the confusion came from how I implemented the
> configuration
> >>> stuff. I tweaked it a bit and here's what we have now compared to my
> >>> initial post:
> >>>
> >>> *hadoop-script.properties* got a lot slimmer
> >>>
> >>> gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
> >>>
> >>>
> >>
> gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
> >>>
> >>
> gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
> >>>
> >>
> gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
> >>> gremlin.hadoop.jarsInDistributedCache=true
> >>> gremlin.hadoop.inputLocation=grateful-dead.txt
> >>>
> gremlin.hadoop.scriptInputFormat.script=script-input-grateful-dead.groovy
> >>> gremlin.hadoop.outputLocation=output
> >>>
> >>> spark.master=local[4]
> >>> spark.executor.memory=1g
> >>> spark.serializer=org.apache.spark.serializer.KryoSerializer
> >>>
> >>> The builder now provides fluent methods to configure the
> >>> BulkLoaderVertexProgram:
> >>>
> >>> blgr = GraphFactory.open("hadoop-script.properties")
> >>> blvp = BulkLoaderVertexProgram.build().
> >>>       vertexIdProperty("bulkloader.vertex.id").
> >>>       keepOriginalIds(false).
> >>>       writeGraph("titan-cassandra-bulk.properties").
> >>>       intermediateBatchSize(10000).create(blgr)
> >>> blgr.compute(SparkGraphComputer).program(blvp).submit().get()
> >>>
> >>>
> >>> titan-cassandra-bulk.properties looks just like any other Graph
> >>> configuration file you're already familiar with:
> >>>
> >>> gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
> >>>
> >>> storage.backend=cassandra
> >>> storage.hostname=127.0.0.1
> >>> storage.batch-loading=true
> >>>
> >>>
> >>> And one last note regarding the question *"why do we need 2
> >>> configurations? gremlin.hadoop.graphOutputFormat
> >>> and gremlin.bulkLoaderVertexProgram.graph.**class"*:
> >>>
> >>> We don't. graphOutputFormat can't be used by the BLVP, since we don't
> >> only
> >>> write, but also read elements from the target graph. Since a
> >> VertexProgram
> >>> doesn't know anything about the OutputFormat, we wouldn't be able to
> get
> >> an
> >>> instance of that graph and thus we wouldn't be able to read from it.
> >>> Consequently the graphOutputFormat can/should always be
> NullOutputFormat
> >>> for BLVP (any other output format would have the same effect), instead
> >> the
> >>> writeGraph configuration (as shown in my last sample) always has to be
> >>> provided.
> >>>
> >>> Cheers,
> >>> Daniel
> >>>
> >>>
> >>>
> >>> On Mon, Aug 31, 2015 at 4:34 PM, Marko Rodriguez <[email protected]
> >
> >>> wrote:
> >>>
> >>>> Hi Daniel,
> >>>>
> >>>> This is great that we now have bulk loading via TP3 GraphComputer.
> >>>> However, before this gets merged, it is important that you get your
> >>>> configuration model consistent with the pattern we are using with
> other
> >>>> VertexPrograms. Problems:
> >>>>
> >>>>       1. gremlin.bulkLoaderVertexProgram.graph.storage.backend
> >>>>               - What does this have to do with Gremlin? This is a
> Titan
> >>>> thing. We can not mix this. Titan is not TinkerPop.
> >>>>               - Titan needs its own namespace. Talk to the Titan guys
> >>>> and see what they are using for namespace conventions too as if you
> are
> >>>> committing to Titan's code base, you should follow their pattern (and
> >> not
> >>>> make up your own).
> >>>>       2. gremlin.hadoop.graphOutputFormat
> >>>>               - This is where you specify your output, not in two
> >> places
> >>>> e.g. -- ? gremlin.bulkLoaderVertexProgram.graph.class. ?
> >>>>       3. Please look at how you are doing your fluent building of
> >>>> BulkLoaderVertexProgram. Do no use Object[] key values.
> >>>>               - Study the pre-existing vertex programs and follow
> their
> >>>> pattern.
> >>>>
> >>>> In general, do your best to follow existing patterns. If everyone has
> >>>> different naming conventions, fluent APIs, etc. as that will make TP3
> >> feel
> >>>> disjoint. Study what has already been created (configurations, fluent
> >> APIs,
> >>>> etc.) and use that model.
> >>>>
> >>>> Thanks Daniel,
> >>>> Marko.
> >>>>
> >>>> http://markorodriguez.com
> >>>>
> >>>> On Aug 31, 2015, at 8:06 AM, Daniel Kuppitz <[email protected]> wrote:
> >>>>
> >>>>> Hello TinkerPop devs,
> >>>>>
> >>>>> over the last couple of days we've implemented a
> >> BulkLoaderVertexProgram
> >>>>> for TinkerPop3. I know a lot of people are waiting for it and I
> guess -
> >>>>> once it's released - it won't take very long until Stephen and I
> >> continue
> >>>>> the Powers of Ten
> >>>>> <http://thinkaurelius.com/2014/05/29/powers-of-ten-part-i/> blog
> post
> >>>>> series.
> >>>>>
> >>>>> The TinkerPop3 BulkLoaderVertexProgram comes with an
> >>>> IncrementalBulkLoader
> >>>>> implementation that is used by default. However, it's easy to use
> your
> >>>> own
> >>>>> customized implementation of a bulk loader. The vertex program
> supports
> >>>> all
> >>>>> the input format you're already familiar with (GraphSON, Kryo,
> Script).
> >>>> As
> >>>>> a target graph you can use any graph that supports multiple
> concurrent
> >>>>> connections (unfortunately that restriction disqualifies Neo4j as its
> >>>>> current TP3 implementation does not support the HA mode). Let me walk
> >> you
> >>>>> through a simple example that loads the Grateful Dead graph into
> Titan.
> >>>>>
> >>>>> *Prerequisites*
> >>>>>
> >>>>> - TinkerPop3 (development branch: blvp)
> >>>>> - Titan 0.9 (customized build)
> >>>>> - a running Hadoop (pseudo) cluster
> >>>>> - Cassandra 2.1.x (for this particular example, as I'm going to use
> >>>>> Titan/Cassandra)
> >>>>>
> >>>>> *Build TinkerPop3 from source*
> >>>>>
> >>>>> git clone https://github.com/apache/incubator-tinkerpop.git
> >>>>> cd incubator-tinkerpop
> >>>>> git checkout blvp
> >>>>> mvn clean install -DskipTests
> >>>>>
> >>>>> *Build Titan from source*
> >>>>>
> >>>>> git clone https://github.com/thinkaurelius/titan.git
> >>>>> cd titan
> >>>>> sed 's@
> >>>>
> >>
> <tinkerpop.version>.*</tinkerpop.version>@<tinkerpop.version>3.0.1-SNAPSHOT</tinkerpop.version>@'
> >>>>> pom.xml > pom.xml.new
> >>>>> mv pom.xml.new pom.xml
> >>>>> mvn clean install -DskipTests
> >>>>>
> >>>>> *Copy the Grateful Dead files to HDFS*
> >>>>>
> >>>>> cd incubator-tinkerpop
> >>>>> find . -name script-input-grateful-dead.groovy | head -n1 | xargs -I
> {}
> >>>>> hadoop fs -copyFromLocal {} script-input-grateful-dead.groovy
> >>>>> find . -name grateful-dead.txt | head -n1 | xargs -I {} hadoop fs
> >>>>> -copyFromLocal {} grateful-dead.txt
> >>>>>
> >>>>> *Create 2 configuration files - 1 for Titan/Cassandra, one for
> Hadoop /
> >>>> for
> >>>>> the BulkLoader*
> >>>>>
> >>>>> *titan-cassandra.properties*
> >>>>>
> >>>>> gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
> >>>>>
> >>>>> storage.backend=cassandrathrift
> >>>>> storage.hostname=127.0.0.1
> >>>>>
> >>>>> *hadoop-script.properties*
> >>>>>
> >>>>>
> gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
> >>>>>
> >>>>>
> >>>>
> >>
> gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
> >>>>>
> >>>>
> >>
> gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
> >>>>>
> >>>>
> >>
> gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
> >>>>> gremlin.hadoop.jarsInDistributedCache=true
> >>>>> gremlin.hadoop.inputLocation=grateful-dead.txt
> >>>>>
> >>
> gremlin.hadoop.scriptInputFormat.script=script-input-grateful-dead.groovy
> >>>>> gremlin.hadoop.outputLocation=output
> >>>>>
> >>>>> # Bulk Loader configuration
> >>>>>
> >>>>
> >>
> gremlin.bulkLoaderVertexProgram.loader.class=org.apache.tinkerpop.gremlin.process.computer.bulkloading.IncrementalBulkLoader
> >>>>> gremlin.bulkLoaderVertexProgram.loader.vertexIdProperty=
> >>>> bulkloader.vertex.id
> >>>>> gremlin.bulkLoaderVertexProgram.loader.userSuppliedIds=false
> >>>>> gremlin.bulkLoaderVertexProgram.loader.keepOriginalIds=false
> >>>>>
> >>>>
> >>
> gremlin.bulkLoaderVertexProgram.graph.class=com.thinkaurelius.titan.core.TitanFactory
> >>>>> gremlin.bulkLoaderVertexProgram.graph.storage.backend=cassandrathrift
> >>>>> gremlin.bulkLoaderVertexProgram.graph.storage.hostname=127.0.0.1
> >>>>> gremlin.bulkLoaderVertexProgram.graph.storage.batch-loading=true
> >>>>> gremlin.bulkLoaderVertexProgram.intermediateBatchSize=10000
> >>>>>
> >>>>> spark.master=local[4]
> >>>>> spark.executor.memory=1g
> >>>>> spark.serializer=org.apache.spark.serializer.KryoSerializer
> >>>>>
> >>>>> *Create the Titan schema*
> >>>>>
> >>>>> cd titan
> >>>>> bin/gremlin.sh
> >>>>>
> >>>>> graph = GraphFactory.open("titan-cassandra.properties")
> >>>>> m = graph.openManagement()
> >>>>> // vertex labels
> >>>>> artist = m.makeVertexLabel("artist").make()
> >>>>> song   = m.makeVertexLabel("song").make()
> >>>>> // edge labels
> >>>>> sungBy     = m.makeEdgeLabel("sungBy").make()
> >>>>> writtenBy  = m.makeEdgeLabel("writtenBy").make()
> >>>>> followedBy = m.makeEdgeLabel("followedBy").make()
> >>>>> // vertex and edge properties
> >>>>> blid         = m.makePropertyKey("bulkloader.vertex.id
> >>>>> ").dataType(Long.class).make()
> >>>>> name         =
> m.makePropertyKey("name").dataType(String.class).make()
> >>>>> songType     =
> >>>> m.makePropertyKey("songType").dataType(String.class).make()
> >>>>> performances =
> >>>>> m.makePropertyKey("performances").dataType(Integer.class).make()
> >>>>> weight       =
> >> m.makePropertyKey("weight").dataType(Integer.class).make()
> >>>>> // global indices
> >>>>> m.buildIndex("byBulkLoaderVertexId",
> >>>>> Vertex.class).addKey(blid).buildCompositeIndex()
> >>>>> m.buildIndex("artistsByName",
> >>>>> Vertex.class).addKey(name).indexOnly(artist).buildCompositeIndex()
> >>>>> m.buildIndex("songsByName",
> >>>>> Vertex.class).addKey(name).indexOnly(song).buildCompositeIndex()
> >>>>> // vertex centric indices
> >>>>> m.buildEdgeIndex(followedBy, "followedByTime", Direction.BOTH,
> >>>> Order.decr,
> >>>>> weight)
> >>>>> m.commit()
> >>>>> graph.close()
> >>>>>
> >>>>> Up to this point it's the usual stuff that we do all day long; create
> >>>>> configurations, create schemas, mess with Hadoop... All in all
> nothing
> >>>>> special.
> >>>>>
> >>>>> *Here comes the new part - start the BulkLoadervertexProgram*
> >>>>>
> >>>>> blgr = GraphFactory.open("hadoop-script.properties")
> >>>>> blvp = BulkLoaderVertexProgram.build().create(blgr)
> >>>>> blgr.compute(SparkGraphComputer).program(blvp).submit().get()
> >>>>>
> >>>>> Note that you don't have to have to Bulk Loader configuration
> embedded
> >> in
> >>>>> you Hadoop graph configuration file, you can also do:
> >>>>>
> >>>>> blgr = GraphFactory.open("hadoop-script.properties")
> >>>>> blvp = BulkLoaderVertexProgram.build().configure(
> >>>>>      // default values not included
> >>>>>      "loader.vertexIdProperty", "bulkloader.vertex.id",
> >>>>>      "loader.keepOriginalIds", false,
> >>>>>      "graph.class", "com.thinkaurelius.titan.core.TitanFactory"
> >>>>>      "graph.storage.backend", "cassandrathrift"
> >>>>>      "graph.storage.hostname", "127.0.0.1",
> >>>>>      "graph.storage.batch-loading", true,
> >>>>>      "intermediateBatchSize", 10000
> >>>>>  ).create(blgr)
> >>>>> blgr.compute(SparkGraphComputer).program(blvp).submit().get()
> >>>>>
> >>>>> ...or simply mix both approaches.
> >>>>>
> >>>>> Play around with it and let us know what you think. If we get enough
> >>>>> positive feedback / no negative feedback, the BulkLoaderVertexProgram
> >>>> will
> >>>>> make it into the next TinkerPop release (3.0.1) and thus also into
> the
> >>>> next
> >>>>> Titan release.
> >>>>>
> >>>>> Cheers,
> >>>>> Daniel
> >>>>
> >>>>
> >>
> >>
>
>

Re: [TinkerPop3] Bulk Loading

Reply via email to