Under the hood the last example generates this configuration: *# from hadoop-script.properties:* gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat gremlin.hadoop.jarsInDistributedCache=true gremlin.hadoop.inputLocation=grateful-dead.txt gremlin.hadoop.scriptInputFormat.script=script-input-grateful-dead.groovy gremlin.hadoop.outputLocation=output spark.master=local[4] spark.executor.memory=1g spark.serializer=org.apache.spark.serializer.KryoSerializer
*# from the builder's fluent methods:* loader.vertexIdProperty="bulkloader.vertex.id" loader.keepOriginalIds=false loader.intermediateBatchSize=10000 *# from the writeGraph method:* graph.gremlin.graph=com.thinkaurelius.titan.core.TitanFactory graph.storage.backend=cassandra graph.storage.hostname=127.0.0.1 graph.storage.batch-loading=true What BulkLoaderVertexProgram ultimately gets from the Builder is: loader.vertexIdProperty="bulkloader.vertex.id" loader.keepOriginalIds=false loader.intermediateBatchSize=10000 graph.gremlin.graph=com.thinkaurelius.titan.core.TitanFactory graph.storage.backend=cassandra graph.storage.hostname=127.0.0.1 graph.storage.batch-loading=true The loader subset is then passed to the BulkLoader implementation, the graph subset is passed to GraphFactory.open(). Cheers, Daniel On Mon, Aug 31, 2015 at 8:55 PM, Marko Rodriguez <[email protected]> wrote: > Hi Daniel, > > That looks really good. Just for closure, can you show me what your > properties file looks like after it gets "unrolled" by BLVP? Even if you > fat finger in an few lines for example. > > Thanks, > Marko. > > http://markorodriguez.com > > On Aug 31, 2015, at 12:39 PM, Daniel Kuppitz <[email protected]> wrote: > > > Okay, most of the confusion came from how I implemented the configuration > > stuff. I tweaked it a bit and here's what we have now compared to my > > initial post: > > > > *hadoop-script.properties* got a lot slimmer > > > > gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph > > > > > gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat > > > gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat > > > gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat > > gremlin.hadoop.jarsInDistributedCache=true > > gremlin.hadoop.inputLocation=grateful-dead.txt > > gremlin.hadoop.scriptInputFormat.script=script-input-grateful-dead.groovy > > gremlin.hadoop.outputLocation=output > > > > spark.master=local[4] > > spark.executor.memory=1g > > spark.serializer=org.apache.spark.serializer.KryoSerializer > > > > The builder now provides fluent methods to configure the > > BulkLoaderVertexProgram: > > > > blgr = GraphFactory.open("hadoop-script.properties") > > blvp = BulkLoaderVertexProgram.build(). > > vertexIdProperty("bulkloader.vertex.id"). > > keepOriginalIds(false). > > writeGraph("titan-cassandra-bulk.properties"). > > intermediateBatchSize(10000).create(blgr) > > blgr.compute(SparkGraphComputer).program(blvp).submit().get() > > > > > > titan-cassandra-bulk.properties looks just like any other Graph > > configuration file you're already familiar with: > > > > gremlin.graph=com.thinkaurelius.titan.core.TitanFactory > > > > storage.backend=cassandra > > storage.hostname=127.0.0.1 > > storage.batch-loading=true > > > > > > And one last note regarding the question *"why do we need 2 > > configurations? gremlin.hadoop.graphOutputFormat > > and gremlin.bulkLoaderVertexProgram.graph.**class"*: > > > > We don't. graphOutputFormat can't be used by the BLVP, since we don't > only > > write, but also read elements from the target graph. Since a > VertexProgram > > doesn't know anything about the OutputFormat, we wouldn't be able to get > an > > instance of that graph and thus we wouldn't be able to read from it. > > Consequently the graphOutputFormat can/should always be NullOutputFormat > > for BLVP (any other output format would have the same effect), instead > the > > writeGraph configuration (as shown in my last sample) always has to be > > provided. > > > > Cheers, > > Daniel > > > > > > > > On Mon, Aug 31, 2015 at 4:34 PM, Marko Rodriguez <[email protected]> > > wrote: > > > >> Hi Daniel, > >> > >> This is great that we now have bulk loading via TP3 GraphComputer. > >> However, before this gets merged, it is important that you get your > >> configuration model consistent with the pattern we are using with other > >> VertexPrograms. Problems: > >> > >> 1. gremlin.bulkLoaderVertexProgram.graph.storage.backend > >> - What does this have to do with Gremlin? This is a Titan > >> thing. We can not mix this. Titan is not TinkerPop. > >> - Titan needs its own namespace. Talk to the Titan guys > >> and see what they are using for namespace conventions too as if you are > >> committing to Titan's code base, you should follow their pattern (and > not > >> make up your own). > >> 2. gremlin.hadoop.graphOutputFormat > >> - This is where you specify your output, not in two > places > >> e.g. -- ? gremlin.bulkLoaderVertexProgram.graph.class. ? > >> 3. Please look at how you are doing your fluent building of > >> BulkLoaderVertexProgram. Do no use Object[] key values. > >> - Study the pre-existing vertex programs and follow their > >> pattern. > >> > >> In general, do your best to follow existing patterns. If everyone has > >> different naming conventions, fluent APIs, etc. as that will make TP3 > feel > >> disjoint. Study what has already been created (configurations, fluent > APIs, > >> etc.) and use that model. > >> > >> Thanks Daniel, > >> Marko. > >> > >> http://markorodriguez.com > >> > >> On Aug 31, 2015, at 8:06 AM, Daniel Kuppitz <[email protected]> wrote: > >> > >>> Hello TinkerPop devs, > >>> > >>> over the last couple of days we've implemented a > BulkLoaderVertexProgram > >>> for TinkerPop3. I know a lot of people are waiting for it and I guess - > >>> once it's released - it won't take very long until Stephen and I > continue > >>> the Powers of Ten > >>> <http://thinkaurelius.com/2014/05/29/powers-of-ten-part-i/> blog post > >>> series. > >>> > >>> The TinkerPop3 BulkLoaderVertexProgram comes with an > >> IncrementalBulkLoader > >>> implementation that is used by default. However, it's easy to use your > >> own > >>> customized implementation of a bulk loader. The vertex program supports > >> all > >>> the input format you're already familiar with (GraphSON, Kryo, Script). > >> As > >>> a target graph you can use any graph that supports multiple concurrent > >>> connections (unfortunately that restriction disqualifies Neo4j as its > >>> current TP3 implementation does not support the HA mode). Let me walk > you > >>> through a simple example that loads the Grateful Dead graph into Titan. > >>> > >>> *Prerequisites* > >>> > >>> - TinkerPop3 (development branch: blvp) > >>> - Titan 0.9 (customized build) > >>> - a running Hadoop (pseudo) cluster > >>> - Cassandra 2.1.x (for this particular example, as I'm going to use > >>> Titan/Cassandra) > >>> > >>> *Build TinkerPop3 from source* > >>> > >>> git clone https://github.com/apache/incubator-tinkerpop.git > >>> cd incubator-tinkerpop > >>> git checkout blvp > >>> mvn clean install -DskipTests > >>> > >>> *Build Titan from source* > >>> > >>> git clone https://github.com/thinkaurelius/titan.git > >>> cd titan > >>> sed 's@ > >> > <tinkerpop.version>.*</tinkerpop.version>@<tinkerpop.version>3.0.1-SNAPSHOT</tinkerpop.version>@' > >>> pom.xml > pom.xml.new > >>> mv pom.xml.new pom.xml > >>> mvn clean install -DskipTests > >>> > >>> *Copy the Grateful Dead files to HDFS* > >>> > >>> cd incubator-tinkerpop > >>> find . -name script-input-grateful-dead.groovy | head -n1 | xargs -I {} > >>> hadoop fs -copyFromLocal {} script-input-grateful-dead.groovy > >>> find . -name grateful-dead.txt | head -n1 | xargs -I {} hadoop fs > >>> -copyFromLocal {} grateful-dead.txt > >>> > >>> *Create 2 configuration files - 1 for Titan/Cassandra, one for Hadoop / > >> for > >>> the BulkLoader* > >>> > >>> *titan-cassandra.properties* > >>> > >>> gremlin.graph=com.thinkaurelius.titan.core.TitanFactory > >>> > >>> storage.backend=cassandrathrift > >>> storage.hostname=127.0.0.1 > >>> > >>> *hadoop-script.properties* > >>> > >>> gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph > >>> > >>> > >> > gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat > >>> > >> > gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat > >>> > >> > gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat > >>> gremlin.hadoop.jarsInDistributedCache=true > >>> gremlin.hadoop.inputLocation=grateful-dead.txt > >>> > gremlin.hadoop.scriptInputFormat.script=script-input-grateful-dead.groovy > >>> gremlin.hadoop.outputLocation=output > >>> > >>> # Bulk Loader configuration > >>> > >> > gremlin.bulkLoaderVertexProgram.loader.class=org.apache.tinkerpop.gremlin.process.computer.bulkloading.IncrementalBulkLoader > >>> gremlin.bulkLoaderVertexProgram.loader.vertexIdProperty= > >> bulkloader.vertex.id > >>> gremlin.bulkLoaderVertexProgram.loader.userSuppliedIds=false > >>> gremlin.bulkLoaderVertexProgram.loader.keepOriginalIds=false > >>> > >> > gremlin.bulkLoaderVertexProgram.graph.class=com.thinkaurelius.titan.core.TitanFactory > >>> gremlin.bulkLoaderVertexProgram.graph.storage.backend=cassandrathrift > >>> gremlin.bulkLoaderVertexProgram.graph.storage.hostname=127.0.0.1 > >>> gremlin.bulkLoaderVertexProgram.graph.storage.batch-loading=true > >>> gremlin.bulkLoaderVertexProgram.intermediateBatchSize=10000 > >>> > >>> spark.master=local[4] > >>> spark.executor.memory=1g > >>> spark.serializer=org.apache.spark.serializer.KryoSerializer > >>> > >>> *Create the Titan schema* > >>> > >>> cd titan > >>> bin/gremlin.sh > >>> > >>> graph = GraphFactory.open("titan-cassandra.properties") > >>> m = graph.openManagement() > >>> // vertex labels > >>> artist = m.makeVertexLabel("artist").make() > >>> song = m.makeVertexLabel("song").make() > >>> // edge labels > >>> sungBy = m.makeEdgeLabel("sungBy").make() > >>> writtenBy = m.makeEdgeLabel("writtenBy").make() > >>> followedBy = m.makeEdgeLabel("followedBy").make() > >>> // vertex and edge properties > >>> blid = m.makePropertyKey("bulkloader.vertex.id > >>> ").dataType(Long.class).make() > >>> name = m.makePropertyKey("name").dataType(String.class).make() > >>> songType = > >> m.makePropertyKey("songType").dataType(String.class).make() > >>> performances = > >>> m.makePropertyKey("performances").dataType(Integer.class).make() > >>> weight = > m.makePropertyKey("weight").dataType(Integer.class).make() > >>> // global indices > >>> m.buildIndex("byBulkLoaderVertexId", > >>> Vertex.class).addKey(blid).buildCompositeIndex() > >>> m.buildIndex("artistsByName", > >>> Vertex.class).addKey(name).indexOnly(artist).buildCompositeIndex() > >>> m.buildIndex("songsByName", > >>> Vertex.class).addKey(name).indexOnly(song).buildCompositeIndex() > >>> // vertex centric indices > >>> m.buildEdgeIndex(followedBy, "followedByTime", Direction.BOTH, > >> Order.decr, > >>> weight) > >>> m.commit() > >>> graph.close() > >>> > >>> Up to this point it's the usual stuff that we do all day long; create > >>> configurations, create schemas, mess with Hadoop... All in all nothing > >>> special. > >>> > >>> *Here comes the new part - start the BulkLoadervertexProgram* > >>> > >>> blgr = GraphFactory.open("hadoop-script.properties") > >>> blvp = BulkLoaderVertexProgram.build().create(blgr) > >>> blgr.compute(SparkGraphComputer).program(blvp).submit().get() > >>> > >>> Note that you don't have to have to Bulk Loader configuration embedded > in > >>> you Hadoop graph configuration file, you can also do: > >>> > >>> blgr = GraphFactory.open("hadoop-script.properties") > >>> blvp = BulkLoaderVertexProgram.build().configure( > >>> // default values not included > >>> "loader.vertexIdProperty", "bulkloader.vertex.id", > >>> "loader.keepOriginalIds", false, > >>> "graph.class", "com.thinkaurelius.titan.core.TitanFactory" > >>> "graph.storage.backend", "cassandrathrift" > >>> "graph.storage.hostname", "127.0.0.1", > >>> "graph.storage.batch-loading", true, > >>> "intermediateBatchSize", 10000 > >>> ).create(blgr) > >>> blgr.compute(SparkGraphComputer).program(blvp).submit().get() > >>> > >>> ...or simply mix both approaches. > >>> > >>> Play around with it and let us know what you think. If we get enough > >>> positive feedback / no negative feedback, the BulkLoaderVertexProgram > >> will > >>> make it into the next TinkerPop release (3.0.1) and thus also into the > >> next > >>> Titan release. > >>> > >>> Cheers, > >>> Daniel > >> > >> > >
