Hi,

I have been working with Dan LaRocque (Titan) today to get SparkGraphComputer 
working over Titan. We ran into a "huh…" moment when I realized that 
Spark/GiraphGraphComputer are super specific to FileOutputFormats. While Dan 
was able to get reading into Titan via TitanInputFormat working easily enough, 
writing was not so bueno. As it stood, Spark/GiraphGraphComputer were assuming 
that the reading/writing was always via HDFS and thus, particular persistence 
options didn't make sense -- e.g. Persist.ORIGINAL. 

To rectify the situation, I did the following:

        1. HadoopGraph is simply a shell around a Configuration (this has 
always been like this). There is no real data with HadoopGraph as the data is 
pulled in at execution time via the InputFormat. As such, while no code 
changed, conceptually, vendors think of HadoopGraph as their access point to 
TP3 Hadoop-features.
        public GraphComputer compute(Class<? extends GraphComputer> 
graphComputerClass) {
          if(graphComputerClass instanceof SparkGraphComputer) {
            return new SparkGraphComputer(new 
HadoopGraph(this.configuration()));
          } else if (graphComputerClass instanceof GiraphGraphComputer) {
            return new GiraphGraphComputer(new 
HadoopGraph(this.configuration()));
          } else ...
        }

        2. I created a PersistResultGraphAware interface which has a single 
method:
                public boolean supportsResultGraphPersistCombination(final 
GraphComputer.ResultGraph resultGraph, final GraphComputer.Persist persist);
             If you are a vendor and you have your own OutputFormat, make sure 
it implements this interface. This way, persistence and data requirements are 
known by Spark/GiraphGraphComputer. For the standard file-based OutputFormats 
provided by TP3 HadoopGraph, the method body is:
        @Override 
        public boolean supportsResultGraphPersistCombination(final 
GraphComputer.ResultGraph resultGraph, final GraphComputer.Persist persist) { 
          return persist.equals(GraphComputer.Persist.NOTHING) || 
resultGraph.equals(GraphComputer.ResultGraph.NEW); 
        }
             Why this body? Because for file-based OutputFormats you can not 
update the original data file as files in HDFS are not random access. Thus, 
only graph clones can be created.

Dan LaRocque is working on a TitanOutputFormat as we speak so we will see what 
other concepts may need to be tweaked.

*** LaRocque: Want to know something crazy -- we can do this "new 
HadoopGraph(titanConfiguration).traversal().V()" and OLTP linear scan the data 
out of Titan via the TitanInputFormat :) Classy. :D

This is really nice for other vendors like Neo4j, OrientDB, etc. Just create an 
Input/OutputFormat and you are automagically able to do OLAP graph operations 
via Hadoop/Spark/Giraph.

Enjoy,
Marko.

http://markorodriguez.com

Reply via email to