Re: [TinkerPop3] Bulk Loading

Marko Rodriguez Mon, 31 Aug 2015 12:37:22 -0700

Thanks.

Can you have the builder's fluid methods be namespaces to 
bulkLoaderVertexProgram, please.


gremlin.bulkLoaderVertexProgram.vertexIdProperty=
gremlin.bulkLoaderVertexProgram.keepOriginalIds=

Next, I would make your writeGraph part prefixed accordingly. You simply do 
"graph". Is that sufficient? HadoopGraph and TitanFactory are going to be vying 
for that namespace. Perhaps, given that you are nicely loading this from a 
preexisting configuration file, just use:

gremlin.bulkLoaderVertexProgram namespace again.


gremlin.bulkLoaderVertexProgram.writeGraph.graph
gremlin.bulkLoaderVertexProgram.writeGraph.storage.backend
gremlin.bulkLoaderVertexProgram.writeGraph.storage.hostname
…
etc.

Thanks,
Marko.

http://markorodriguez.com

On Aug 31, 2015, at 1:11 PM, Daniel Kuppitz <[email protected]> wrote:

> Under the hood the last example generates this configuration:
> 
> *# from hadoop-script.properties:*
> gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
> gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
> gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
> gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
> gremlin.hadoop.jarsInDistributedCache=true
> gremlin.hadoop.inputLocation=grateful-dead.txt
> gremlin.hadoop.scriptInputFormat.script=script-input-grateful-dead.groovy
> gremlin.hadoop.outputLocation=output
> spark.master=local[4]
> spark.executor.memory=1g
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> 
> *# from the builder's fluent methods:*
> loader.vertexIdProperty="bulkloader.vertex.id"
> loader.keepOriginalIds=false
> loader.intermediateBatchSize=10000
> 
> *# from the writeGraph method:*
> graph.gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
> graph.storage.backend=cassandra
> graph.storage.hostname=127.0.0.1
> graph.storage.batch-loading=true
> 
> 
> What BulkLoaderVertexProgram ultimately gets from the Builder is:
> 
> loader.vertexIdProperty="bulkloader.vertex.id"
> loader.keepOriginalIds=false
> loader.intermediateBatchSize=10000
> graph.gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
> graph.storage.backend=cassandra
> graph.storage.hostname=127.0.0.1
> graph.storage.batch-loading=true
> 
> 
> 
> The loader subset is then passed to the BulkLoader implementation, the graph
> subset is passed to GraphFactory.open().
> 
> Cheers,
> Daniel
> 
> 
> 
> On Mon, Aug 31, 2015 at 8:55 PM, Marko Rodriguez <[email protected]>
> wrote:
> 
>> Hi Daniel,
>> 
>> That looks really good. Just for closure, can you show me what your
>> properties file looks like after it gets "unrolled" by BLVP? Even if you
>> fat finger in an few lines for example.
>> 
>> Thanks,
>> Marko.
>> 
>> http://markorodriguez.com
>> 
>> On Aug 31, 2015, at 12:39 PM, Daniel Kuppitz <[email protected]> wrote:
>> 
>>> Okay, most of the confusion came from how I implemented the configuration
>>> stuff. I tweaked it a bit and here's what we have now compared to my
>>> initial post:
>>> 
>>> *hadoop-script.properties* got a lot slimmer
>>> 
>>> gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
>>> 
>>> 
>> gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
>>> 
>> gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
>>> 
>> gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
>>> gremlin.hadoop.jarsInDistributedCache=true
>>> gremlin.hadoop.inputLocation=grateful-dead.txt
>>> gremlin.hadoop.scriptInputFormat.script=script-input-grateful-dead.groovy
>>> gremlin.hadoop.outputLocation=output
>>> 
>>> spark.master=local[4]
>>> spark.executor.memory=1g
>>> spark.serializer=org.apache.spark.serializer.KryoSerializer
>>> 
>>> The builder now provides fluent methods to configure the
>>> BulkLoaderVertexProgram:
>>> 
>>> blgr = GraphFactory.open("hadoop-script.properties")
>>> blvp = BulkLoaderVertexProgram.build().
>>>       vertexIdProperty("bulkloader.vertex.id").
>>>       keepOriginalIds(false).
>>>       writeGraph("titan-cassandra-bulk.properties").
>>>       intermediateBatchSize(10000).create(blgr)
>>> blgr.compute(SparkGraphComputer).program(blvp).submit().get()
>>> 
>>> 
>>> titan-cassandra-bulk.properties looks just like any other Graph
>>> configuration file you're already familiar with:
>>> 
>>> gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
>>> 
>>> storage.backend=cassandra
>>> storage.hostname=127.0.0.1
>>> storage.batch-loading=true
>>> 
>>> 
>>> And one last note regarding the question *"why do we need 2
>>> configurations? gremlin.hadoop.graphOutputFormat
>>> and gremlin.bulkLoaderVertexProgram.graph.**class"*:
>>> 
>>> We don't. graphOutputFormat can't be used by the BLVP, since we don't
>> only
>>> write, but also read elements from the target graph. Since a
>> VertexProgram
>>> doesn't know anything about the OutputFormat, we wouldn't be able to get
>> an
>>> instance of that graph and thus we wouldn't be able to read from it.
>>> Consequently the graphOutputFormat can/should always be NullOutputFormat
>>> for BLVP (any other output format would have the same effect), instead
>> the
>>> writeGraph configuration (as shown in my last sample) always has to be
>>> provided.
>>> 
>>> Cheers,
>>> Daniel
>>> 
>>> 
>>> 
>>> On Mon, Aug 31, 2015 at 4:34 PM, Marko Rodriguez <[email protected]>
>>> wrote:
>>> 
>>>> Hi Daniel,
>>>> 
>>>> This is great that we now have bulk loading via TP3 GraphComputer.
>>>> However, before this gets merged, it is important that you get your
>>>> configuration model consistent with the pattern we are using with other
>>>> VertexPrograms. Problems:
>>>> 
>>>>       1. gremlin.bulkLoaderVertexProgram.graph.storage.backend
>>>>               - What does this have to do with Gremlin? This is a Titan
>>>> thing. We can not mix this. Titan is not TinkerPop.
>>>>               - Titan needs its own namespace. Talk to the Titan guys
>>>> and see what they are using for namespace conventions too as if you are
>>>> committing to Titan's code base, you should follow their pattern (and
>> not
>>>> make up your own).
>>>>       2. gremlin.hadoop.graphOutputFormat
>>>>               - This is where you specify your output, not in two
>> places
>>>> e.g. -- ? gremlin.bulkLoaderVertexProgram.graph.class. ?
>>>>       3. Please look at how you are doing your fluent building of
>>>> BulkLoaderVertexProgram. Do no use Object[] key values.
>>>>               - Study the pre-existing vertex programs and follow their
>>>> pattern.
>>>> 
>>>> In general, do your best to follow existing patterns. If everyone has
>>>> different naming conventions, fluent APIs, etc. as that will make TP3
>> feel
>>>> disjoint. Study what has already been created (configurations, fluent
>> APIs,
>>>> etc.) and use that model.
>>>> 
>>>> Thanks Daniel,
>>>> Marko.
>>>> 
>>>> http://markorodriguez.com
>>>> 
>>>> On Aug 31, 2015, at 8:06 AM, Daniel Kuppitz <[email protected]> wrote:
>>>> 
>>>>> Hello TinkerPop devs,
>>>>> 
>>>>> over the last couple of days we've implemented a
>> BulkLoaderVertexProgram
>>>>> for TinkerPop3. I know a lot of people are waiting for it and I guess -
>>>>> once it's released - it won't take very long until Stephen and I
>> continue
>>>>> the Powers of Ten
>>>>> <http://thinkaurelius.com/2014/05/29/powers-of-ten-part-i/> blog post
>>>>> series.
>>>>> 
>>>>> The TinkerPop3 BulkLoaderVertexProgram comes with an
>>>> IncrementalBulkLoader
>>>>> implementation that is used by default. However, it's easy to use your
>>>> own
>>>>> customized implementation of a bulk loader. The vertex program supports
>>>> all
>>>>> the input format you're already familiar with (GraphSON, Kryo, Script).
>>>> As
>>>>> a target graph you can use any graph that supports multiple concurrent
>>>>> connections (unfortunately that restriction disqualifies Neo4j as its
>>>>> current TP3 implementation does not support the HA mode). Let me walk
>> you
>>>>> through a simple example that loads the Grateful Dead graph into Titan.
>>>>> 
>>>>> *Prerequisites*
>>>>> 
>>>>> - TinkerPop3 (development branch: blvp)
>>>>> - Titan 0.9 (customized build)
>>>>> - a running Hadoop (pseudo) cluster
>>>>> - Cassandra 2.1.x (for this particular example, as I'm going to use
>>>>> Titan/Cassandra)
>>>>> 
>>>>> *Build TinkerPop3 from source*
>>>>> 
>>>>> git clone https://github.com/apache/incubator-tinkerpop.git
>>>>> cd incubator-tinkerpop
>>>>> git checkout blvp
>>>>> mvn clean install -DskipTests
>>>>> 
>>>>> *Build Titan from source*
>>>>> 
>>>>> git clone https://github.com/thinkaurelius/titan.git
>>>>> cd titan
>>>>> sed 's@
>>>> 
>> <tinkerpop.version>.*</tinkerpop.version>@<tinkerpop.version>3.0.1-SNAPSHOT</tinkerpop.version>@'
>>>>> pom.xml > pom.xml.new
>>>>> mv pom.xml.new pom.xml
>>>>> mvn clean install -DskipTests
>>>>> 
>>>>> *Copy the Grateful Dead files to HDFS*
>>>>> 
>>>>> cd incubator-tinkerpop
>>>>> find . -name script-input-grateful-dead.groovy | head -n1 | xargs -I {}
>>>>> hadoop fs -copyFromLocal {} script-input-grateful-dead.groovy
>>>>> find . -name grateful-dead.txt | head -n1 | xargs -I {} hadoop fs
>>>>> -copyFromLocal {} grateful-dead.txt
>>>>> 
>>>>> *Create 2 configuration files - 1 for Titan/Cassandra, one for Hadoop /
>>>> for
>>>>> the BulkLoader*
>>>>> 
>>>>> *titan-cassandra.properties*
>>>>> 
>>>>> gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
>>>>> 
>>>>> storage.backend=cassandrathrift
>>>>> storage.hostname=127.0.0.1
>>>>> 
>>>>> *hadoop-script.properties*
>>>>> 
>>>>> gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
>>>>> 
>>>>> 
>>>> 
>> gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
>>>>> 
>>>> 
>> gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
>>>>> 
>>>> 
>> gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
>>>>> gremlin.hadoop.jarsInDistributedCache=true
>>>>> gremlin.hadoop.inputLocation=grateful-dead.txt
>>>>> 
>> gremlin.hadoop.scriptInputFormat.script=script-input-grateful-dead.groovy
>>>>> gremlin.hadoop.outputLocation=output
>>>>> 
>>>>> # Bulk Loader configuration
>>>>> 
>>>> 
>> gremlin.bulkLoaderVertexProgram.loader.class=org.apache.tinkerpop.gremlin.process.computer.bulkloading.IncrementalBulkLoader
>>>>> gremlin.bulkLoaderVertexProgram.loader.vertexIdProperty=
>>>> bulkloader.vertex.id
>>>>> gremlin.bulkLoaderVertexProgram.loader.userSuppliedIds=false
>>>>> gremlin.bulkLoaderVertexProgram.loader.keepOriginalIds=false
>>>>> 
>>>> 
>> gremlin.bulkLoaderVertexProgram.graph.class=com.thinkaurelius.titan.core.TitanFactory
>>>>> gremlin.bulkLoaderVertexProgram.graph.storage.backend=cassandrathrift
>>>>> gremlin.bulkLoaderVertexProgram.graph.storage.hostname=127.0.0.1
>>>>> gremlin.bulkLoaderVertexProgram.graph.storage.batch-loading=true
>>>>> gremlin.bulkLoaderVertexProgram.intermediateBatchSize=10000
>>>>> 
>>>>> spark.master=local[4]
>>>>> spark.executor.memory=1g
>>>>> spark.serializer=org.apache.spark.serializer.KryoSerializer
>>>>> 
>>>>> *Create the Titan schema*
>>>>> 
>>>>> cd titan
>>>>> bin/gremlin.sh
>>>>> 
>>>>> graph = GraphFactory.open("titan-cassandra.properties")
>>>>> m = graph.openManagement()
>>>>> // vertex labels
>>>>> artist = m.makeVertexLabel("artist").make()
>>>>> song   = m.makeVertexLabel("song").make()
>>>>> // edge labels
>>>>> sungBy     = m.makeEdgeLabel("sungBy").make()
>>>>> writtenBy  = m.makeEdgeLabel("writtenBy").make()
>>>>> followedBy = m.makeEdgeLabel("followedBy").make()
>>>>> // vertex and edge properties
>>>>> blid         = m.makePropertyKey("bulkloader.vertex.id
>>>>> ").dataType(Long.class).make()
>>>>> name         = m.makePropertyKey("name").dataType(String.class).make()
>>>>> songType     =
>>>> m.makePropertyKey("songType").dataType(String.class).make()
>>>>> performances =
>>>>> m.makePropertyKey("performances").dataType(Integer.class).make()
>>>>> weight       =
>> m.makePropertyKey("weight").dataType(Integer.class).make()
>>>>> // global indices
>>>>> m.buildIndex("byBulkLoaderVertexId",
>>>>> Vertex.class).addKey(blid).buildCompositeIndex()
>>>>> m.buildIndex("artistsByName",
>>>>> Vertex.class).addKey(name).indexOnly(artist).buildCompositeIndex()
>>>>> m.buildIndex("songsByName",
>>>>> Vertex.class).addKey(name).indexOnly(song).buildCompositeIndex()
>>>>> // vertex centric indices
>>>>> m.buildEdgeIndex(followedBy, "followedByTime", Direction.BOTH,
>>>> Order.decr,
>>>>> weight)
>>>>> m.commit()
>>>>> graph.close()
>>>>> 
>>>>> Up to this point it's the usual stuff that we do all day long; create
>>>>> configurations, create schemas, mess with Hadoop... All in all nothing
>>>>> special.
>>>>> 
>>>>> *Here comes the new part - start the BulkLoadervertexProgram*
>>>>> 
>>>>> blgr = GraphFactory.open("hadoop-script.properties")
>>>>> blvp = BulkLoaderVertexProgram.build().create(blgr)
>>>>> blgr.compute(SparkGraphComputer).program(blvp).submit().get()
>>>>> 
>>>>> Note that you don't have to have to Bulk Loader configuration embedded
>> in
>>>>> you Hadoop graph configuration file, you can also do:
>>>>> 
>>>>> blgr = GraphFactory.open("hadoop-script.properties")
>>>>> blvp = BulkLoaderVertexProgram.build().configure(
>>>>>      // default values not included
>>>>>      "loader.vertexIdProperty", "bulkloader.vertex.id",
>>>>>      "loader.keepOriginalIds", false,
>>>>>      "graph.class", "com.thinkaurelius.titan.core.TitanFactory"
>>>>>      "graph.storage.backend", "cassandrathrift"
>>>>>      "graph.storage.hostname", "127.0.0.1",
>>>>>      "graph.storage.batch-loading", true,
>>>>>      "intermediateBatchSize", 10000
>>>>>  ).create(blgr)
>>>>> blgr.compute(SparkGraphComputer).program(blvp).submit().get()
>>>>> 
>>>>> ...or simply mix both approaches.
>>>>> 
>>>>> Play around with it and let us know what you think. If we get enough
>>>>> positive feedback / no negative feedback, the BulkLoaderVertexProgram
>>>> will
>>>>> make it into the next TinkerPop release (3.0.1) and thus also into the
>>>> next
>>>>> Titan release.
>>>>> 
>>>>> Cheers,
>>>>> Daniel
>>>> 
>>>> 
>> 
>>

Re: [TinkerPop3] Bulk Loading

Reply via email to