Done done. Functionality again validated by several manual tests. Cheers, Daniel
On Mon, Aug 31, 2015 at 9:36 PM, Marko Rodriguez <[email protected]> wrote: > Thanks. > > Can you have the builder's fluid methods be namespaces to > bulkLoaderVertexProgram, please. > > gremlin.bulkLoaderVertexProgram.vertexIdProperty= > gremlin.bulkLoaderVertexProgram.keepOriginalIds= > > Next, I would make your writeGraph part prefixed accordingly. You simply > do "graph". Is that sufficient? HadoopGraph and TitanFactory are going to > be vying for that namespace. Perhaps, given that you are nicely loading > this from a preexisting configuration file, just use: > > gremlin.bulkLoaderVertexProgram namespace again. > > > gremlin.bulkLoaderVertexProgram.writeGraph.graph > gremlin.bulkLoaderVertexProgram.writeGraph.storage.backend > gremlin.bulkLoaderVertexProgram.writeGraph.storage.hostname > … > etc. > > Thanks, > Marko. > > http://markorodriguez.com > > On Aug 31, 2015, at 1:11 PM, Daniel Kuppitz <[email protected]> wrote: > > > Under the hood the last example generates this configuration: > > > > *# from hadoop-script.properties:* > > gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph > > > gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat > > > gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat > > > gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat > > gremlin.hadoop.jarsInDistributedCache=true > > gremlin.hadoop.inputLocation=grateful-dead.txt > > gremlin.hadoop.scriptInputFormat.script=script-input-grateful-dead.groovy > > gremlin.hadoop.outputLocation=output > > spark.master=local[4] > > spark.executor.memory=1g > > spark.serializer=org.apache.spark.serializer.KryoSerializer > > > > *# from the builder's fluent methods:* > > loader.vertexIdProperty="bulkloader.vertex.id" > > loader.keepOriginalIds=false > > loader.intermediateBatchSize=10000 > > > > *# from the writeGraph method:* > > graph.gremlin.graph=com.thinkaurelius.titan.core.TitanFactory > > graph.storage.backend=cassandra > > graph.storage.hostname=127.0.0.1 > > graph.storage.batch-loading=true > > > > > > What BulkLoaderVertexProgram ultimately gets from the Builder is: > > > > loader.vertexIdProperty="bulkloader.vertex.id" > > loader.keepOriginalIds=false > > loader.intermediateBatchSize=10000 > > graph.gremlin.graph=com.thinkaurelius.titan.core.TitanFactory > > graph.storage.backend=cassandra > > graph.storage.hostname=127.0.0.1 > > graph.storage.batch-loading=true > > > > > > > > The loader subset is then passed to the BulkLoader implementation, the > graph > > subset is passed to GraphFactory.open(). > > > > Cheers, > > Daniel > > > > > > > > On Mon, Aug 31, 2015 at 8:55 PM, Marko Rodriguez <[email protected]> > > wrote: > > > >> Hi Daniel, > >> > >> That looks really good. Just for closure, can you show me what your > >> properties file looks like after it gets "unrolled" by BLVP? Even if you > >> fat finger in an few lines for example. > >> > >> Thanks, > >> Marko. > >> > >> http://markorodriguez.com > >> > >> On Aug 31, 2015, at 12:39 PM, Daniel Kuppitz <[email protected]> wrote: > >> > >>> Okay, most of the confusion came from how I implemented the > configuration > >>> stuff. I tweaked it a bit and here's what we have now compared to my > >>> initial post: > >>> > >>> *hadoop-script.properties* got a lot slimmer > >>> > >>> gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph > >>> > >>> > >> > gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat > >>> > >> > gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat > >>> > >> > gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat > >>> gremlin.hadoop.jarsInDistributedCache=true > >>> gremlin.hadoop.inputLocation=grateful-dead.txt > >>> > gremlin.hadoop.scriptInputFormat.script=script-input-grateful-dead.groovy > >>> gremlin.hadoop.outputLocation=output > >>> > >>> spark.master=local[4] > >>> spark.executor.memory=1g > >>> spark.serializer=org.apache.spark.serializer.KryoSerializer > >>> > >>> The builder now provides fluent methods to configure the > >>> BulkLoaderVertexProgram: > >>> > >>> blgr = GraphFactory.open("hadoop-script.properties") > >>> blvp = BulkLoaderVertexProgram.build(). > >>> vertexIdProperty("bulkloader.vertex.id"). > >>> keepOriginalIds(false). > >>> writeGraph("titan-cassandra-bulk.properties"). > >>> intermediateBatchSize(10000).create(blgr) > >>> blgr.compute(SparkGraphComputer).program(blvp).submit().get() > >>> > >>> > >>> titan-cassandra-bulk.properties looks just like any other Graph > >>> configuration file you're already familiar with: > >>> > >>> gremlin.graph=com.thinkaurelius.titan.core.TitanFactory > >>> > >>> storage.backend=cassandra > >>> storage.hostname=127.0.0.1 > >>> storage.batch-loading=true > >>> > >>> > >>> And one last note regarding the question *"why do we need 2 > >>> configurations? gremlin.hadoop.graphOutputFormat > >>> and gremlin.bulkLoaderVertexProgram.graph.**class"*: > >>> > >>> We don't. graphOutputFormat can't be used by the BLVP, since we don't > >> only > >>> write, but also read elements from the target graph. Since a > >> VertexProgram > >>> doesn't know anything about the OutputFormat, we wouldn't be able to > get > >> an > >>> instance of that graph and thus we wouldn't be able to read from it. > >>> Consequently the graphOutputFormat can/should always be > NullOutputFormat > >>> for BLVP (any other output format would have the same effect), instead > >> the > >>> writeGraph configuration (as shown in my last sample) always has to be > >>> provided. > >>> > >>> Cheers, > >>> Daniel > >>> > >>> > >>> > >>> On Mon, Aug 31, 2015 at 4:34 PM, Marko Rodriguez <[email protected] > > > >>> wrote: > >>> > >>>> Hi Daniel, > >>>> > >>>> This is great that we now have bulk loading via TP3 GraphComputer. > >>>> However, before this gets merged, it is important that you get your > >>>> configuration model consistent with the pattern we are using with > other > >>>> VertexPrograms. Problems: > >>>> > >>>> 1. gremlin.bulkLoaderVertexProgram.graph.storage.backend > >>>> - What does this have to do with Gremlin? This is a > Titan > >>>> thing. We can not mix this. Titan is not TinkerPop. > >>>> - Titan needs its own namespace. Talk to the Titan guys > >>>> and see what they are using for namespace conventions too as if you > are > >>>> committing to Titan's code base, you should follow their pattern (and > >> not > >>>> make up your own). > >>>> 2. gremlin.hadoop.graphOutputFormat > >>>> - This is where you specify your output, not in two > >> places > >>>> e.g. -- ? gremlin.bulkLoaderVertexProgram.graph.class. ? > >>>> 3. Please look at how you are doing your fluent building of > >>>> BulkLoaderVertexProgram. Do no use Object[] key values. > >>>> - Study the pre-existing vertex programs and follow > their > >>>> pattern. > >>>> > >>>> In general, do your best to follow existing patterns. If everyone has > >>>> different naming conventions, fluent APIs, etc. as that will make TP3 > >> feel > >>>> disjoint. Study what has already been created (configurations, fluent > >> APIs, > >>>> etc.) and use that model. > >>>> > >>>> Thanks Daniel, > >>>> Marko. > >>>> > >>>> http://markorodriguez.com > >>>> > >>>> On Aug 31, 2015, at 8:06 AM, Daniel Kuppitz <[email protected]> wrote: > >>>> > >>>>> Hello TinkerPop devs, > >>>>> > >>>>> over the last couple of days we've implemented a > >> BulkLoaderVertexProgram > >>>>> for TinkerPop3. I know a lot of people are waiting for it and I > guess - > >>>>> once it's released - it won't take very long until Stephen and I > >> continue > >>>>> the Powers of Ten > >>>>> <http://thinkaurelius.com/2014/05/29/powers-of-ten-part-i/> blog > post > >>>>> series. > >>>>> > >>>>> The TinkerPop3 BulkLoaderVertexProgram comes with an > >>>> IncrementalBulkLoader > >>>>> implementation that is used by default. However, it's easy to use > your > >>>> own > >>>>> customized implementation of a bulk loader. The vertex program > supports > >>>> all > >>>>> the input format you're already familiar with (GraphSON, Kryo, > Script). > >>>> As > >>>>> a target graph you can use any graph that supports multiple > concurrent > >>>>> connections (unfortunately that restriction disqualifies Neo4j as its > >>>>> current TP3 implementation does not support the HA mode). Let me walk > >> you > >>>>> through a simple example that loads the Grateful Dead graph into > Titan. > >>>>> > >>>>> *Prerequisites* > >>>>> > >>>>> - TinkerPop3 (development branch: blvp) > >>>>> - Titan 0.9 (customized build) > >>>>> - a running Hadoop (pseudo) cluster > >>>>> - Cassandra 2.1.x (for this particular example, as I'm going to use > >>>>> Titan/Cassandra) > >>>>> > >>>>> *Build TinkerPop3 from source* > >>>>> > >>>>> git clone https://github.com/apache/incubator-tinkerpop.git > >>>>> cd incubator-tinkerpop > >>>>> git checkout blvp > >>>>> mvn clean install -DskipTests > >>>>> > >>>>> *Build Titan from source* > >>>>> > >>>>> git clone https://github.com/thinkaurelius/titan.git > >>>>> cd titan > >>>>> sed 's@ > >>>> > >> > <tinkerpop.version>.*</tinkerpop.version>@<tinkerpop.version>3.0.1-SNAPSHOT</tinkerpop.version>@' > >>>>> pom.xml > pom.xml.new > >>>>> mv pom.xml.new pom.xml > >>>>> mvn clean install -DskipTests > >>>>> > >>>>> *Copy the Grateful Dead files to HDFS* > >>>>> > >>>>> cd incubator-tinkerpop > >>>>> find . -name script-input-grateful-dead.groovy | head -n1 | xargs -I > {} > >>>>> hadoop fs -copyFromLocal {} script-input-grateful-dead.groovy > >>>>> find . -name grateful-dead.txt | head -n1 | xargs -I {} hadoop fs > >>>>> -copyFromLocal {} grateful-dead.txt > >>>>> > >>>>> *Create 2 configuration files - 1 for Titan/Cassandra, one for > Hadoop / > >>>> for > >>>>> the BulkLoader* > >>>>> > >>>>> *titan-cassandra.properties* > >>>>> > >>>>> gremlin.graph=com.thinkaurelius.titan.core.TitanFactory > >>>>> > >>>>> storage.backend=cassandrathrift > >>>>> storage.hostname=127.0.0.1 > >>>>> > >>>>> *hadoop-script.properties* > >>>>> > >>>>> > gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph > >>>>> > >>>>> > >>>> > >> > gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat > >>>>> > >>>> > >> > gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat > >>>>> > >>>> > >> > gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat > >>>>> gremlin.hadoop.jarsInDistributedCache=true > >>>>> gremlin.hadoop.inputLocation=grateful-dead.txt > >>>>> > >> > gremlin.hadoop.scriptInputFormat.script=script-input-grateful-dead.groovy > >>>>> gremlin.hadoop.outputLocation=output > >>>>> > >>>>> # Bulk Loader configuration > >>>>> > >>>> > >> > gremlin.bulkLoaderVertexProgram.loader.class=org.apache.tinkerpop.gremlin.process.computer.bulkloading.IncrementalBulkLoader > >>>>> gremlin.bulkLoaderVertexProgram.loader.vertexIdProperty= > >>>> bulkloader.vertex.id > >>>>> gremlin.bulkLoaderVertexProgram.loader.userSuppliedIds=false > >>>>> gremlin.bulkLoaderVertexProgram.loader.keepOriginalIds=false > >>>>> > >>>> > >> > gremlin.bulkLoaderVertexProgram.graph.class=com.thinkaurelius.titan.core.TitanFactory > >>>>> gremlin.bulkLoaderVertexProgram.graph.storage.backend=cassandrathrift > >>>>> gremlin.bulkLoaderVertexProgram.graph.storage.hostname=127.0.0.1 > >>>>> gremlin.bulkLoaderVertexProgram.graph.storage.batch-loading=true > >>>>> gremlin.bulkLoaderVertexProgram.intermediateBatchSize=10000 > >>>>> > >>>>> spark.master=local[4] > >>>>> spark.executor.memory=1g > >>>>> spark.serializer=org.apache.spark.serializer.KryoSerializer > >>>>> > >>>>> *Create the Titan schema* > >>>>> > >>>>> cd titan > >>>>> bin/gremlin.sh > >>>>> > >>>>> graph = GraphFactory.open("titan-cassandra.properties") > >>>>> m = graph.openManagement() > >>>>> // vertex labels > >>>>> artist = m.makeVertexLabel("artist").make() > >>>>> song = m.makeVertexLabel("song").make() > >>>>> // edge labels > >>>>> sungBy = m.makeEdgeLabel("sungBy").make() > >>>>> writtenBy = m.makeEdgeLabel("writtenBy").make() > >>>>> followedBy = m.makeEdgeLabel("followedBy").make() > >>>>> // vertex and edge properties > >>>>> blid = m.makePropertyKey("bulkloader.vertex.id > >>>>> ").dataType(Long.class).make() > >>>>> name = > m.makePropertyKey("name").dataType(String.class).make() > >>>>> songType = > >>>> m.makePropertyKey("songType").dataType(String.class).make() > >>>>> performances = > >>>>> m.makePropertyKey("performances").dataType(Integer.class).make() > >>>>> weight = > >> m.makePropertyKey("weight").dataType(Integer.class).make() > >>>>> // global indices > >>>>> m.buildIndex("byBulkLoaderVertexId", > >>>>> Vertex.class).addKey(blid).buildCompositeIndex() > >>>>> m.buildIndex("artistsByName", > >>>>> Vertex.class).addKey(name).indexOnly(artist).buildCompositeIndex() > >>>>> m.buildIndex("songsByName", > >>>>> Vertex.class).addKey(name).indexOnly(song).buildCompositeIndex() > >>>>> // vertex centric indices > >>>>> m.buildEdgeIndex(followedBy, "followedByTime", Direction.BOTH, > >>>> Order.decr, > >>>>> weight) > >>>>> m.commit() > >>>>> graph.close() > >>>>> > >>>>> Up to this point it's the usual stuff that we do all day long; create > >>>>> configurations, create schemas, mess with Hadoop... All in all > nothing > >>>>> special. > >>>>> > >>>>> *Here comes the new part - start the BulkLoadervertexProgram* > >>>>> > >>>>> blgr = GraphFactory.open("hadoop-script.properties") > >>>>> blvp = BulkLoaderVertexProgram.build().create(blgr) > >>>>> blgr.compute(SparkGraphComputer).program(blvp).submit().get() > >>>>> > >>>>> Note that you don't have to have to Bulk Loader configuration > embedded > >> in > >>>>> you Hadoop graph configuration file, you can also do: > >>>>> > >>>>> blgr = GraphFactory.open("hadoop-script.properties") > >>>>> blvp = BulkLoaderVertexProgram.build().configure( > >>>>> // default values not included > >>>>> "loader.vertexIdProperty", "bulkloader.vertex.id", > >>>>> "loader.keepOriginalIds", false, > >>>>> "graph.class", "com.thinkaurelius.titan.core.TitanFactory" > >>>>> "graph.storage.backend", "cassandrathrift" > >>>>> "graph.storage.hostname", "127.0.0.1", > >>>>> "graph.storage.batch-loading", true, > >>>>> "intermediateBatchSize", 10000 > >>>>> ).create(blgr) > >>>>> blgr.compute(SparkGraphComputer).program(blvp).submit().get() > >>>>> > >>>>> ...or simply mix both approaches. > >>>>> > >>>>> Play around with it and let us know what you think. If we get enough > >>>>> positive feedback / no negative feedback, the BulkLoaderVertexProgram > >>>> will > >>>>> make it into the next TinkerPop release (3.0.1) and thus also into > the > >>>> next > >>>>> Titan release. > >>>>> > >>>>> Cheers, > >>>>> Daniel > >>>> > >>>> > >> > >> > >
