Hi,
I have been working with Dan LaRocque (Titan) today to get SparkGraphComputer
working over Titan. We ran into a "huh…" moment when I realized that
Spark/GiraphGraphComputer are super specific to FileOutputFormats. While Dan
was able to get reading into Titan via TitanInputFormat working easily enough,
writing was not so bueno. As it stood, Spark/GiraphGraphComputer were assuming
that the reading/writing was always via HDFS and thus, particular persistence
options didn't make sense -- e.g. Persist.ORIGINAL.
To rectify the situation, I did the following:
1. HadoopGraph is simply a shell around a Configuration (this has
always been like this). There is no real data with HadoopGraph as the data is
pulled in at execution time via the InputFormat. As such, while no code
changed, conceptually, vendors think of HadoopGraph as their access point to
TP3 Hadoop-features.
public GraphComputer compute(Class<? extends GraphComputer>
graphComputerClass) {
if(graphComputerClass instanceof SparkGraphComputer) {
return new SparkGraphComputer(new
HadoopGraph(this.configuration()));
} else if (graphComputerClass instanceof GiraphGraphComputer) {
return new GiraphGraphComputer(new
HadoopGraph(this.configuration()));
} else ...
}
2. I created a PersistResultGraphAware interface which has a single
method:
public boolean supportsResultGraphPersistCombination(final
GraphComputer.ResultGraph resultGraph, final GraphComputer.Persist persist);
If you are a vendor and you have your own OutputFormat, make sure
it implements this interface. This way, persistence and data requirements are
known by Spark/GiraphGraphComputer. For the standard file-based OutputFormats
provided by TP3 HadoopGraph, the method body is:
@Override
public boolean supportsResultGraphPersistCombination(final
GraphComputer.ResultGraph resultGraph, final GraphComputer.Persist persist) {
return persist.equals(GraphComputer.Persist.NOTHING) ||
resultGraph.equals(GraphComputer.ResultGraph.NEW);
}
Why this body? Because for file-based OutputFormats you can not
update the original data file as files in HDFS are not random access. Thus,
only graph clones can be created.
Dan LaRocque is working on a TitanOutputFormat as we speak so we will see what
other concepts may need to be tweaked.
*** LaRocque: Want to know something crazy -- we can do this "new
HadoopGraph(titanConfiguration).traversal().V()" and OLTP linear scan the data
out of Titan via the TitanInputFormat :) Classy. :D
This is really nice for other vendors like Neo4j, OrientDB, etc. Just create an
Input/OutputFormat and you are automagically able to do OLAP graph operations
via Hadoop/Spark/Giraph.
Enjoy,
Marko.
http://markorodriguez.com