Re: [TinkerPop3] Bulk Loading

Marko Rodriguez Mon, 31 Aug 2015 11:55:46 -0700

Hi Daniel,

That looks really good. Just for closure, can you show me what your properties 
file looks like after it gets "unrolled" by BLVP? Even if you fat finger in an 
few lines for example.


Thanks,
Marko.

http://markorodriguez.com

On Aug 31, 2015, at 12:39 PM, Daniel Kuppitz <[email protected]> wrote:

> Okay, most of the confusion came from how I implemented the configuration
> stuff. I tweaked it a bit and here's what we have now compared to my
> initial post:
> 
> *hadoop-script.properties* got a lot slimmer
> 
> gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
> 
> gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
> gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
> gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
> gremlin.hadoop.jarsInDistributedCache=true
> gremlin.hadoop.inputLocation=grateful-dead.txt
> gremlin.hadoop.scriptInputFormat.script=script-input-grateful-dead.groovy
> gremlin.hadoop.outputLocation=output
> 
> spark.master=local[4]
> spark.executor.memory=1g
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> 
> The builder now provides fluent methods to configure the
> BulkLoaderVertexProgram:
> 
> blgr = GraphFactory.open("hadoop-script.properties")
> blvp = BulkLoaderVertexProgram.build().
>        vertexIdProperty("bulkloader.vertex.id").
>        keepOriginalIds(false).
>        writeGraph("titan-cassandra-bulk.properties").
>        intermediateBatchSize(10000).create(blgr)
> blgr.compute(SparkGraphComputer).program(blvp).submit().get()
> 
> 
> titan-cassandra-bulk.properties looks just like any other Graph
> configuration file you're already familiar with:
> 
> gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
> 
> storage.backend=cassandra
> storage.hostname=127.0.0.1
> storage.batch-loading=true
> 
> 
> And one last note regarding the question *"why do we need 2
> configurations? gremlin.hadoop.graphOutputFormat
> and gremlin.bulkLoaderVertexProgram.graph.**class"*:
> 
> We don't. graphOutputFormat can't be used by the BLVP, since we don't only
> write, but also read elements from the target graph. Since a VertexProgram
> doesn't know anything about the OutputFormat, we wouldn't be able to get an
> instance of that graph and thus we wouldn't be able to read from it.
> Consequently the graphOutputFormat can/should always be NullOutputFormat
> for BLVP (any other output format would have the same effect), instead the
> writeGraph configuration (as shown in my last sample) always has to be
> provided.
> 
> Cheers,
> Daniel
> 
> 
> 
> On Mon, Aug 31, 2015 at 4:34 PM, Marko Rodriguez <[email protected]>
> wrote:
> 
>> Hi Daniel,
>> 
>> This is great that we now have bulk loading via TP3 GraphComputer.
>> However, before this gets merged, it is important that you get your
>> configuration model consistent with the pattern we are using with other
>> VertexPrograms. Problems:
>> 
>>        1. gremlin.bulkLoaderVertexProgram.graph.storage.backend
>>                - What does this have to do with Gremlin? This is a Titan
>> thing. We can not mix this. Titan is not TinkerPop.
>>                - Titan needs its own namespace. Talk to the Titan guys
>> and see what they are using for namespace conventions too as if you are
>> committing to Titan's code base, you should follow their pattern (and not
>> make up your own).
>>        2. gremlin.hadoop.graphOutputFormat
>>                - This is where you specify your output, not in two places
>> e.g. -- ? gremlin.bulkLoaderVertexProgram.graph.class. ?
>>        3. Please look at how you are doing your fluent building of
>> BulkLoaderVertexProgram. Do no use Object[] key values.
>>                - Study the pre-existing vertex programs and follow their
>> pattern.
>> 
>> In general, do your best to follow existing patterns. If everyone has
>> different naming conventions, fluent APIs, etc. as that will make TP3 feel
>> disjoint. Study what has already been created (configurations, fluent APIs,
>> etc.) and use that model.
>> 
>> Thanks Daniel,
>> Marko.
>> 
>> http://markorodriguez.com
>> 
>> On Aug 31, 2015, at 8:06 AM, Daniel Kuppitz <[email protected]> wrote:
>> 
>>> Hello TinkerPop devs,
>>> 
>>> over the last couple of days we've implemented a BulkLoaderVertexProgram
>>> for TinkerPop3. I know a lot of people are waiting for it and I guess -
>>> once it's released - it won't take very long until Stephen and I continue
>>> the Powers of Ten
>>> <http://thinkaurelius.com/2014/05/29/powers-of-ten-part-i/> blog post
>>> series.
>>> 
>>> The TinkerPop3 BulkLoaderVertexProgram comes with an
>> IncrementalBulkLoader
>>> implementation that is used by default. However, it's easy to use your
>> own
>>> customized implementation of a bulk loader. The vertex program supports
>> all
>>> the input format you're already familiar with (GraphSON, Kryo, Script).
>> As
>>> a target graph you can use any graph that supports multiple concurrent
>>> connections (unfortunately that restriction disqualifies Neo4j as its
>>> current TP3 implementation does not support the HA mode). Let me walk you
>>> through a simple example that loads the Grateful Dead graph into Titan.
>>> 
>>> *Prerequisites*
>>> 
>>>  - TinkerPop3 (development branch: blvp)
>>>  - Titan 0.9 (customized build)
>>>  - a running Hadoop (pseudo) cluster
>>>  - Cassandra 2.1.x (for this particular example, as I'm going to use
>>>  Titan/Cassandra)
>>> 
>>> *Build TinkerPop3 from source*
>>> 
>>> git clone https://github.com/apache/incubator-tinkerpop.git
>>> cd incubator-tinkerpop
>>> git checkout blvp
>>> mvn clean install -DskipTests
>>> 
>>> *Build Titan from source*
>>> 
>>> git clone https://github.com/thinkaurelius/titan.git
>>> cd titan
>>> sed 's@
>> <tinkerpop.version>.*</tinkerpop.version>@<tinkerpop.version>3.0.1-SNAPSHOT</tinkerpop.version>@'
>>> pom.xml > pom.xml.new
>>> mv pom.xml.new pom.xml
>>> mvn clean install -DskipTests
>>> 
>>> *Copy the Grateful Dead files to HDFS*
>>> 
>>> cd incubator-tinkerpop
>>> find . -name script-input-grateful-dead.groovy | head -n1 | xargs -I {}
>>> hadoop fs -copyFromLocal {} script-input-grateful-dead.groovy
>>> find . -name grateful-dead.txt | head -n1 | xargs -I {} hadoop fs
>>> -copyFromLocal {} grateful-dead.txt
>>> 
>>> *Create 2 configuration files - 1 for Titan/Cassandra, one for Hadoop /
>> for
>>> the BulkLoader*
>>> 
>>> *titan-cassandra.properties*
>>> 
>>> gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
>>> 
>>> storage.backend=cassandrathrift
>>> storage.hostname=127.0.0.1
>>> 
>>> *hadoop-script.properties*
>>> 
>>> gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
>>> 
>>> 
>> gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
>>> 
>> gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
>>> 
>> gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
>>> gremlin.hadoop.jarsInDistributedCache=true
>>> gremlin.hadoop.inputLocation=grateful-dead.txt
>>> gremlin.hadoop.scriptInputFormat.script=script-input-grateful-dead.groovy
>>> gremlin.hadoop.outputLocation=output
>>> 
>>> # Bulk Loader configuration
>>> 
>> gremlin.bulkLoaderVertexProgram.loader.class=org.apache.tinkerpop.gremlin.process.computer.bulkloading.IncrementalBulkLoader
>>> gremlin.bulkLoaderVertexProgram.loader.vertexIdProperty=
>> bulkloader.vertex.id
>>> gremlin.bulkLoaderVertexProgram.loader.userSuppliedIds=false
>>> gremlin.bulkLoaderVertexProgram.loader.keepOriginalIds=false
>>> 
>> gremlin.bulkLoaderVertexProgram.graph.class=com.thinkaurelius.titan.core.TitanFactory
>>> gremlin.bulkLoaderVertexProgram.graph.storage.backend=cassandrathrift
>>> gremlin.bulkLoaderVertexProgram.graph.storage.hostname=127.0.0.1
>>> gremlin.bulkLoaderVertexProgram.graph.storage.batch-loading=true
>>> gremlin.bulkLoaderVertexProgram.intermediateBatchSize=10000
>>> 
>>> spark.master=local[4]
>>> spark.executor.memory=1g
>>> spark.serializer=org.apache.spark.serializer.KryoSerializer
>>> 
>>> *Create the Titan schema*
>>> 
>>> cd titan
>>> bin/gremlin.sh
>>> 
>>> graph = GraphFactory.open("titan-cassandra.properties")
>>> m = graph.openManagement()
>>> // vertex labels
>>> artist = m.makeVertexLabel("artist").make()
>>> song   = m.makeVertexLabel("song").make()
>>> // edge labels
>>> sungBy     = m.makeEdgeLabel("sungBy").make()
>>> writtenBy  = m.makeEdgeLabel("writtenBy").make()
>>> followedBy = m.makeEdgeLabel("followedBy").make()
>>> // vertex and edge properties
>>> blid         = m.makePropertyKey("bulkloader.vertex.id
>>> ").dataType(Long.class).make()
>>> name         = m.makePropertyKey("name").dataType(String.class).make()
>>> songType     =
>> m.makePropertyKey("songType").dataType(String.class).make()
>>> performances =
>>> m.makePropertyKey("performances").dataType(Integer.class).make()
>>> weight       = m.makePropertyKey("weight").dataType(Integer.class).make()
>>> // global indices
>>> m.buildIndex("byBulkLoaderVertexId",
>>> Vertex.class).addKey(blid).buildCompositeIndex()
>>> m.buildIndex("artistsByName",
>>> Vertex.class).addKey(name).indexOnly(artist).buildCompositeIndex()
>>> m.buildIndex("songsByName",
>>> Vertex.class).addKey(name).indexOnly(song).buildCompositeIndex()
>>> // vertex centric indices
>>> m.buildEdgeIndex(followedBy, "followedByTime", Direction.BOTH,
>> Order.decr,
>>> weight)
>>> m.commit()
>>> graph.close()
>>> 
>>> Up to this point it's the usual stuff that we do all day long; create
>>> configurations, create schemas, mess with Hadoop... All in all nothing
>>> special.
>>> 
>>> *Here comes the new part - start the BulkLoadervertexProgram*
>>> 
>>> blgr = GraphFactory.open("hadoop-script.properties")
>>> blvp = BulkLoaderVertexProgram.build().create(blgr)
>>> blgr.compute(SparkGraphComputer).program(blvp).submit().get()
>>> 
>>> Note that you don't have to have to Bulk Loader configuration embedded in
>>> you Hadoop graph configuration file, you can also do:
>>> 
>>> blgr = GraphFactory.open("hadoop-script.properties")
>>> blvp = BulkLoaderVertexProgram.build().configure(
>>>       // default values not included
>>>       "loader.vertexIdProperty", "bulkloader.vertex.id",
>>>       "loader.keepOriginalIds", false,
>>>       "graph.class", "com.thinkaurelius.titan.core.TitanFactory"
>>>       "graph.storage.backend", "cassandrathrift"
>>>       "graph.storage.hostname", "127.0.0.1",
>>>       "graph.storage.batch-loading", true,
>>>       "intermediateBatchSize", 10000
>>>   ).create(blgr)
>>> blgr.compute(SparkGraphComputer).program(blvp).submit().get()
>>> 
>>> ...or simply mix both approaches.
>>> 
>>> Play around with it and let us know what you think. If we get enough
>>> positive feedback / no negative feedback, the BulkLoaderVertexProgram
>> will
>>> make it into the next TinkerPop release (3.0.1) and thus also into the
>> next
>>> Titan release.
>>> 
>>> Cheers,
>>> Daniel
>> 
>>

Re: [TinkerPop3] Bulk Loading

Reply via email to