[Accumulo Contrib Proposal] Graphulo: Server-side Matrix Math library

Dylan Hutchison Thu, 27 Aug 2015 23:48:29 -0700

Dear Accumulo community,

I humbly ask your consideration of Graphulo
<https://github.com/Accla/graphulo> as a new contrib project to Accumulo.
Let's use this thread to discuss what Graphulo is, how it fits into the
Accumulo community, where we can take it together as a new community, and
how you can use it right now.  Please see the README at Graphulo's Github,
and for a more in-depth look see the docs/ folder or the examples.


https://github.com/Accla/graphulo

Graphulo is a Java library for the Apache Accumulo database delivering
server-side sparse matrix math primitives that enable higher-level graph
algorithms and analytics.

Pitch: Organizations use Accumulo for high performance indexed and
distributed data storage.  What do they do after their data is stored?
Many use cases perform analytics and algorithms on data in Accumulo, which
aside from simple iterators uses, require scanning data out from Accumulo
to a computation engine, only to write computation results back to
Accumulo.  Graphulo enables a class of algorithms to run inside the
Accumulo server like a stored procedure, especially (but not restricted to)
those written in the language of graphs and linear algebra.  Take breadth
first search as a simple use case and PageRank as one more complex.  As a
stretch goal, imagine analysts and mathematicians executing PageRank and
other high level algorithms on top of the Graphulo library on top of
Accumulo at high performance.

I have developed Graphulo at the MIT Lincoln Laboratory with support from
the NSF since last March.  I owe thanks to Jeremy Kepner, Vijay Gadepally,
and Adam Fuchs for high level comments during design and performance
testing phases.  I proposed a now-obsolete design document last Spring to
the Accumulo community too which received good feedback.

The time is ripe for Graphulo to graduate my personal development into
larger participation.  Beyond myself and beyond the Lincoln Laboratory,
Graphulo is for the Accumulo community.  Users need a place where they can
interact, developers need a place where they can look, comment, and debate
designs and diffs, and both users and developers need a place where they
can interact and see Graphulo alongside its Accumulo base.

The following outlines a few reasons why I see contrib to Accumulo as
Graphulo's proper place:

   1. Establishing Graphulo as an Apache (sub-)project is a first step
   toward building a community.  The spirit of Apache--its mailing list
   discussions, low barrier to interactions between users and developers new
   and old, open meritocracy and more--is a surefire way to bring Graphulo to
   the people it will help and the people who want to help it in turn.

   2. Committing to core Accumulo doesn't seem appropriate for all of
   Graphulo, because Graphulo uses Accumulo in a specific way (server-side
   computation) in support of algorithms and applications.  Parts of Graphulo
   that are useful for all Accumulo users (not just matrix math for
   algorithms) could be transferred from Graphulo to Accumulo, such as
   ApplyIterator or SmallLargeRowFilter or DynamicIterator.

   3. Leaving Graphulo as an external project leaves Graphulo too decoupled
   from Accumulo.  Graphulo has potential to drive features in core Accumulo
   such as ACCUMULO-3978, ACCUMULO-3710
   <https://issues.apache.org/jira/browse/ACCUMULO-3710>, and ACCUMULO-3751
   <https://issues.apache.org/jira/browse/ACCUMULO-3751>.  By making
   Graphulo a contrib sub-project, the two can maintain a tight relationship
   while still maintaining independent versions.

Historically, contrib projects have gone into Accumulo contrib and become
stale.  I assure you I do not intend Graphulo this fate.  The Lincoln
Laboratory has interests in Graphulo, and I will continue developing
Graphulo at the very least to help transition Graphulo to greater community
involvement.  However, since I will start a PhD program at UW next month, I
cannot make Graphulo a full time job as I have in recent history.  I do
have ideas for using Graphulo as part of my PhD database research.

Thus, in the case of large community support, I can transition to a support
role while others in the community step up.  If smaller community support,
I can continue working on Graphulo as before at my own pace and perhaps
more publicly.  There are only a few steps left before Graphulo could be
considered "finished software":

   - Developing a new interface to Graphulo's core functions using
   immutable argument objects, which simplifies developer APIs, increases
   generalizability, and facilitates features like asynchronous and parallel
   operations.  It would be good if other developers weigh their opinions on
   designs as we propose them, since this decides how Graphulo users interact
   with Graphulo.

   - Instrumenting Graphulo for monitoring, profiling and benchmarking.  I
   have a blueprint on how to use HTrace to make these tasks as easy as
   browsing a web page.  Needs careful thought and discussion before
   implementing, since this instrumentation will go everywhere.  It would be
   nice if Graphulo and Accumulo mirror instrumentation strategies, so it
   would be good to have that discussion in the same venue.

   - Rigorous *scale testing*.  Good instrumentation is key.  With
   successful scale testing, we paint a clear picture for which operations
   Graphulo excels to potential adopters, ultimately plotting where Graphulo
   stands in the world of big data software.

   - Explicitly supporting the GraphBLAS <http://graphblas.org/> spec, once
   it is agreed upon.  Graphulo was designed from the ground up with the
   GraphBLAS in mind, so this should be an easy task.  Aligning with this
   upcoming industry standard bodes well for ease of developing Graphulo
   algorithms.

Developing more algorithms and applications will follow too, and I imagine
this as an excellent place where newcomer users can get involved.

Some other places Graphulo needs work worth mentioning are creating a
proper release framework (the release here
<https://github.com/Accla/graphulo/releases> could use improvement,
starting with signed artifacts) and reviewing the way Graphulo runs tests
(currently centered around a critical file called TEST_CONFIG.java which is
great for one developer, whereas a config file probably works better).
Both of these are places more experienced developers could help.  I should
also mention that Graphulo has groundwork in place for operations between
Accumulo instances, but I doubt many users would need that level of control.

Regarding IP, I'm happy to donate my commits to the ASF, which covers 99%
of the Graphulo code base.  I'm sure other issues will arise and we can
sort them out.  Sean Busbey, perhaps I could ask your assistance as someone
more knowledgeable in this area.  Regarding dependencies, effectively every
direct dependency is org.apache, so nothing to worry about here.

I acknowledge that I will lose dictatorial power and gain some bureaucratic
/ discussion overhead by moving from sole developer to an Apache model.
The benefits of a community are well worth it.

If we as a community decide that contrib is the right place for Graphulo,
then there are lots of logistical questions to decide like where the code
will live, where JIRA will live, what mailing lists to use, what URL to
give Graphulo within apache.org, etc.  We can tackle these at our leisure.
Let's discuss Graphulo and Accumulo here first.

Warmly,
Dylan Hutchison

[Accumulo Contrib Proposal] Graphulo: Server-side Matrix Math library

Reply via email to