Re: [TinkerPop3] Bulk Loading

Daniel Kuppitz Mon, 31 Aug 2015 12:13:01 -0700

Under the hood the last example generates this configuration:

*# from hadoop-script.properties:*
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=grateful-dead.txt
gremlin.hadoop.scriptInputFormat.script=script-input-grateful-dead.groovy
gremlin.hadoop.outputLocation=output
spark.master=local[4]
spark.executor.memory=1g
spark.serializer=org.apache.spark.serializer.KryoSerializer


*# from the builder's fluent methods:*
loader.vertexIdProperty="bulkloader.vertex.id"
loader.keepOriginalIds=false
loader.intermediateBatchSize=10000

*# from the writeGraph method:*
graph.gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
graph.storage.backend=cassandra
graph.storage.hostname=127.0.0.1
graph.storage.batch-loading=true


What BulkLoaderVertexProgram ultimately gets from the Builder is:

loader.vertexIdProperty="bulkloader.vertex.id"
loader.keepOriginalIds=false
loader.intermediateBatchSize=10000
graph.gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
graph.storage.backend=cassandra
graph.storage.hostname=127.0.0.1
graph.storage.batch-loading=true



The loader subset is then passed to the BulkLoader implementation, the graph
subset is passed to GraphFactory.open().

Cheers,
Daniel



On Mon, Aug 31, 2015 at 8:55 PM, Marko Rodriguez <[email protected]>
wrote:

> Hi Daniel,
>
> That looks really good. Just for closure, can you show me what your
> properties file looks like after it gets "unrolled" by BLVP? Even if you
> fat finger in an few lines for example.
>
> Thanks,
> Marko.
>
> http://markorodriguez.com
>
> On Aug 31, 2015, at 12:39 PM, Daniel Kuppitz <[email protected]> wrote:
>
> > Okay, most of the confusion came from how I implemented the configuration
> > stuff. I tweaked it a bit and here's what we have now compared to my
> > initial post:
> >
> > *hadoop-script.properties* got a lot slimmer
> >
> > gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
> >
> >
> gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
> >
> gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
> >
> gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
> > gremlin.hadoop.jarsInDistributedCache=true
> > gremlin.hadoop.inputLocation=grateful-dead.txt
> > gremlin.hadoop.scriptInputFormat.script=script-input-grateful-dead.groovy
> > gremlin.hadoop.outputLocation=output
> >
> > spark.master=local[4]
> > spark.executor.memory=1g
> > spark.serializer=org.apache.spark.serializer.KryoSerializer
> >
> > The builder now provides fluent methods to configure the
> > BulkLoaderVertexProgram:
> >
> > blgr = GraphFactory.open("hadoop-script.properties")
> > blvp = BulkLoaderVertexProgram.build().
> >        vertexIdProperty("bulkloader.vertex.id").
> >        keepOriginalIds(false).
> >        writeGraph("titan-cassandra-bulk.properties").
> >        intermediateBatchSize(10000).create(blgr)
> > blgr.compute(SparkGraphComputer).program(blvp).submit().get()
> >
> >
> > titan-cassandra-bulk.properties looks just like any other Graph
> > configuration file you're already familiar with:
> >
> > gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
> >
> > storage.backend=cassandra
> > storage.hostname=127.0.0.1
> > storage.batch-loading=true
> >
> >
> > And one last note regarding the question *"why do we need 2
> > configurations? gremlin.hadoop.graphOutputFormat
> > and gremlin.bulkLoaderVertexProgram.graph.**class"*:
> >
> > We don't. graphOutputFormat can't be used by the BLVP, since we don't
> only
> > write, but also read elements from the target graph. Since a
> VertexProgram
> > doesn't know anything about the OutputFormat, we wouldn't be able to get
> an
> > instance of that graph and thus we wouldn't be able to read from it.
> > Consequently the graphOutputFormat can/should always be NullOutputFormat
> > for BLVP (any other output format would have the same effect), instead
> the
> > writeGraph configuration (as shown in my last sample) always has to be
> > provided.
> >
> > Cheers,
> > Daniel
> >
> >
> >
> > On Mon, Aug 31, 2015 at 4:34 PM, Marko Rodriguez <[email protected]>
> > wrote:
> >
> >> Hi Daniel,
> >>
> >> This is great that we now have bulk loading via TP3 GraphComputer.
> >> However, before this gets merged, it is important that you get your
> >> configuration model consistent with the pattern we are using with other
> >> VertexPrograms. Problems:
> >>
> >>        1. gremlin.bulkLoaderVertexProgram.graph.storage.backend
> >>                - What does this have to do with Gremlin? This is a Titan
> >> thing. We can not mix this. Titan is not TinkerPop.
> >>                - Titan needs its own namespace. Talk to the Titan guys
> >> and see what they are using for namespace conventions too as if you are
> >> committing to Titan's code base, you should follow their pattern (and
> not
> >> make up your own).
> >>        2. gremlin.hadoop.graphOutputFormat
> >>                - This is where you specify your output, not in two
> places
> >> e.g. -- ? gremlin.bulkLoaderVertexProgram.graph.class. ?
> >>        3. Please look at how you are doing your fluent building of
> >> BulkLoaderVertexProgram. Do no use Object[] key values.
> >>                - Study the pre-existing vertex programs and follow their
> >> pattern.
> >>
> >> In general, do your best to follow existing patterns. If everyone has
> >> different naming conventions, fluent APIs, etc. as that will make TP3
> feel
> >> disjoint. Study what has already been created (configurations, fluent
> APIs,
> >> etc.) and use that model.
> >>
> >> Thanks Daniel,
> >> Marko.
> >>
> >> http://markorodriguez.com
> >>
> >> On Aug 31, 2015, at 8:06 AM, Daniel Kuppitz <[email protected]> wrote:
> >>
> >>> Hello TinkerPop devs,
> >>>
> >>> over the last couple of days we've implemented a
> BulkLoaderVertexProgram
> >>> for TinkerPop3. I know a lot of people are waiting for it and I guess -
> >>> once it's released - it won't take very long until Stephen and I
> continue
> >>> the Powers of Ten
> >>> <http://thinkaurelius.com/2014/05/29/powers-of-ten-part-i/> blog post
> >>> series.
> >>>
> >>> The TinkerPop3 BulkLoaderVertexProgram comes with an
> >> IncrementalBulkLoader
> >>> implementation that is used by default. However, it's easy to use your
> >> own
> >>> customized implementation of a bulk loader. The vertex program supports
> >> all
> >>> the input format you're already familiar with (GraphSON, Kryo, Script).
> >> As
> >>> a target graph you can use any graph that supports multiple concurrent
> >>> connections (unfortunately that restriction disqualifies Neo4j as its
> >>> current TP3 implementation does not support the HA mode). Let me walk
> you
> >>> through a simple example that loads the Grateful Dead graph into Titan.
> >>>
> >>> *Prerequisites*
> >>>
> >>>  - TinkerPop3 (development branch: blvp)
> >>>  - Titan 0.9 (customized build)
> >>>  - a running Hadoop (pseudo) cluster
> >>>  - Cassandra 2.1.x (for this particular example, as I'm going to use
> >>>  Titan/Cassandra)
> >>>
> >>> *Build TinkerPop3 from source*
> >>>
> >>> git clone https://github.com/apache/incubator-tinkerpop.git
> >>> cd incubator-tinkerpop
> >>> git checkout blvp
> >>> mvn clean install -DskipTests
> >>>
> >>> *Build Titan from source*
> >>>
> >>> git clone https://github.com/thinkaurelius/titan.git
> >>> cd titan
> >>> sed 's@
> >>
> <tinkerpop.version>.*</tinkerpop.version>@<tinkerpop.version>3.0.1-SNAPSHOT</tinkerpop.version>@'
> >>> pom.xml > pom.xml.new
> >>> mv pom.xml.new pom.xml
> >>> mvn clean install -DskipTests
> >>>
> >>> *Copy the Grateful Dead files to HDFS*
> >>>
> >>> cd incubator-tinkerpop
> >>> find . -name script-input-grateful-dead.groovy | head -n1 | xargs -I {}
> >>> hadoop fs -copyFromLocal {} script-input-grateful-dead.groovy
> >>> find . -name grateful-dead.txt | head -n1 | xargs -I {} hadoop fs
> >>> -copyFromLocal {} grateful-dead.txt
> >>>
> >>> *Create 2 configuration files - 1 for Titan/Cassandra, one for Hadoop /
> >> for
> >>> the BulkLoader*
> >>>
> >>> *titan-cassandra.properties*
> >>>
> >>> gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
> >>>
> >>> storage.backend=cassandrathrift
> >>> storage.hostname=127.0.0.1
> >>>
> >>> *hadoop-script.properties*
> >>>
> >>> gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
> >>>
> >>>
> >>
> gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
> >>>
> >>
> gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
> >>>
> >>
> gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
> >>> gremlin.hadoop.jarsInDistributedCache=true
> >>> gremlin.hadoop.inputLocation=grateful-dead.txt
> >>>
> gremlin.hadoop.scriptInputFormat.script=script-input-grateful-dead.groovy
> >>> gremlin.hadoop.outputLocation=output
> >>>
> >>> # Bulk Loader configuration
> >>>
> >>
> gremlin.bulkLoaderVertexProgram.loader.class=org.apache.tinkerpop.gremlin.process.computer.bulkloading.IncrementalBulkLoader
> >>> gremlin.bulkLoaderVertexProgram.loader.vertexIdProperty=
> >> bulkloader.vertex.id
> >>> gremlin.bulkLoaderVertexProgram.loader.userSuppliedIds=false
> >>> gremlin.bulkLoaderVertexProgram.loader.keepOriginalIds=false
> >>>
> >>
> gremlin.bulkLoaderVertexProgram.graph.class=com.thinkaurelius.titan.core.TitanFactory
> >>> gremlin.bulkLoaderVertexProgram.graph.storage.backend=cassandrathrift
> >>> gremlin.bulkLoaderVertexProgram.graph.storage.hostname=127.0.0.1
> >>> gremlin.bulkLoaderVertexProgram.graph.storage.batch-loading=true
> >>> gremlin.bulkLoaderVertexProgram.intermediateBatchSize=10000
> >>>
> >>> spark.master=local[4]
> >>> spark.executor.memory=1g
> >>> spark.serializer=org.apache.spark.serializer.KryoSerializer
> >>>
> >>> *Create the Titan schema*
> >>>
> >>> cd titan
> >>> bin/gremlin.sh
> >>>
> >>> graph = GraphFactory.open("titan-cassandra.properties")
> >>> m = graph.openManagement()
> >>> // vertex labels
> >>> artist = m.makeVertexLabel("artist").make()
> >>> song   = m.makeVertexLabel("song").make()
> >>> // edge labels
> >>> sungBy     = m.makeEdgeLabel("sungBy").make()
> >>> writtenBy  = m.makeEdgeLabel("writtenBy").make()
> >>> followedBy = m.makeEdgeLabel("followedBy").make()
> >>> // vertex and edge properties
> >>> blid         = m.makePropertyKey("bulkloader.vertex.id
> >>> ").dataType(Long.class).make()
> >>> name         = m.makePropertyKey("name").dataType(String.class).make()
> >>> songType     =
> >> m.makePropertyKey("songType").dataType(String.class).make()
> >>> performances =
> >>> m.makePropertyKey("performances").dataType(Integer.class).make()
> >>> weight       =
> m.makePropertyKey("weight").dataType(Integer.class).make()
> >>> // global indices
> >>> m.buildIndex("byBulkLoaderVertexId",
> >>> Vertex.class).addKey(blid).buildCompositeIndex()
> >>> m.buildIndex("artistsByName",
> >>> Vertex.class).addKey(name).indexOnly(artist).buildCompositeIndex()
> >>> m.buildIndex("songsByName",
> >>> Vertex.class).addKey(name).indexOnly(song).buildCompositeIndex()
> >>> // vertex centric indices
> >>> m.buildEdgeIndex(followedBy, "followedByTime", Direction.BOTH,
> >> Order.decr,
> >>> weight)
> >>> m.commit()
> >>> graph.close()
> >>>
> >>> Up to this point it's the usual stuff that we do all day long; create
> >>> configurations, create schemas, mess with Hadoop... All in all nothing
> >>> special.
> >>>
> >>> *Here comes the new part - start the BulkLoadervertexProgram*
> >>>
> >>> blgr = GraphFactory.open("hadoop-script.properties")
> >>> blvp = BulkLoaderVertexProgram.build().create(blgr)
> >>> blgr.compute(SparkGraphComputer).program(blvp).submit().get()
> >>>
> >>> Note that you don't have to have to Bulk Loader configuration embedded
> in
> >>> you Hadoop graph configuration file, you can also do:
> >>>
> >>> blgr = GraphFactory.open("hadoop-script.properties")
> >>> blvp = BulkLoaderVertexProgram.build().configure(
> >>>       // default values not included
> >>>       "loader.vertexIdProperty", "bulkloader.vertex.id",
> >>>       "loader.keepOriginalIds", false,
> >>>       "graph.class", "com.thinkaurelius.titan.core.TitanFactory"
> >>>       "graph.storage.backend", "cassandrathrift"
> >>>       "graph.storage.hostname", "127.0.0.1",
> >>>       "graph.storage.batch-loading", true,
> >>>       "intermediateBatchSize", 10000
> >>>   ).create(blgr)
> >>> blgr.compute(SparkGraphComputer).program(blvp).submit().get()
> >>>
> >>> ...or simply mix both approaches.
> >>>
> >>> Play around with it and let us know what you think. If we get enough
> >>> positive feedback / no negative feedback, the BulkLoaderVertexProgram
> >> will
> >>> make it into the next TinkerPop release (3.0.1) and thus also into the
> >> next
> >>> Titan release.
> >>>
> >>> Cheers,
> >>> Daniel
> >>
> >>
>
>

Re: [TinkerPop3] Bulk Loading

Reply via email to