Dear Accumulo community,
I humbly ask your consideration of Graphulo
<https://github.com/Accla/graphulo> as a new contrib project to
Accumulo. Let's use this thread to discuss what Graphulo is, how it
fits into the Accumulo community, where we can take it together as a new
community, and how you can use it right now. Please see the README at
Graphulo's Github, and for a more in-depth look see the docs/ folder or
the examples.
https://github.com/Accla/graphulo
Graphulo is a Java library for the Apache Accumulo database
delivering server-side sparse matrix math primitives that enable
higher-level graph algorithms and analytics.
Pitch: Organizations use Accumulo for high performance indexed and
distributed data storage. What do they do after their data is stored?
Many use cases perform analytics and algorithms on data in Accumulo,
which aside from simple iterators uses, require scanning data out from
Accumulo to a computation engine, only to write computation results back
to Accumulo. Graphulo enables a class of algorithms to run inside the
Accumulo server like a stored procedure, especially (but not restricted
to) those written in the language of graphs and linear algebra. Take
breadth first search as a simple use case and PageRank as one more
complex. As a stretch goal, imagine analysts and mathematicians
executing PageRank and other high level algorithms on top of the
Graphulo library on top of Accumulo at high performance.
I have developed Graphulo at the MIT Lincoln Laboratory with support
from the NSF since last March. I owe thanks to Jeremy Kepner, Vijay
Gadepally, and Adam Fuchs for high level comments during design and
performance testing phases. I proposed a now-obsolete design document
last Spring to the Accumulo community too which received good feedback.
The time is ripe for Graphulo to graduate my personal development into
larger participation. Beyond myself and beyond the Lincoln Laboratory,
Graphulo is for the Accumulo community. Users need a place where they
can interact, developers need a place where they can look, comment, and
debate designs and diffs, and both users and developers need a place
where they can interact and see Graphulo alongside its Accumulo base.
The following outlines a few reasons why I see contrib to Accumulo as
Graphulo's proper place:
1. Establishing Graphulo as an Apache (sub-)project is a first step
toward building a community. The spirit of Apache--its mailing list
discussions, low barrier to interactions between users and
developers new and old, open meritocracy and more--is a surefire way
to bring Graphulo to the people it will help and the people who want
to help it in turn.
2. Committing to core Accumulo doesn't seem appropriate for all of
Graphulo, because Graphulo uses Accumulo in a specific way
(server-side computation) in support of algorithms and
applications. Parts of Graphulo that are useful for all Accumulo
users (not just matrix math for algorithms) could be transferred
from Graphulo to Accumulo, such as ApplyIterator or
SmallLargeRowFilter or DynamicIterator.
3. Leaving Graphulo as an external project leaves Graphulo too
decoupled from Accumulo. Graphulo has potential to drive features
in core Accumulo such as ACCUMULO-3978 <http://ACCUMULO-3978>,
ACCUMULO-3710 <https://issues.apache.org/jira/browse/ACCUMULO-3710>,
and ACCUMULO-3751
<https://issues.apache.org/jira/browse/ACCUMULO-3751>. By making
Graphulo a contrib sub-project, the two can maintain a tight
relationship while still maintaining independent versions.
Historically, contrib projects have gone into Accumulo contrib and
become stale. I assure you I do not intend Graphulo this fate. The
Lincoln Laboratory has interests in Graphulo, and I will continue
developing Graphulo at the very least to help transition Graphulo to
greater community involvement. However, since I will start a PhD
program at UW next month, I cannot make Graphulo a full time job as I
have in recent history. I do have ideas for using Graphulo as part of
my PhD database research.
Thus, in the case of large community support, I can transition to a
support role while others in the community step up. If smaller
community support, I can continue working on Graphulo as before at my
own pace and perhaps more publicly. There are only a few steps left
before Graphulo could be considered "finished software":
* Developing a new interface to Graphulo's core functions using
immutable argument objects, which simplifies developer APIs,
increases generalizability, and facilitates features like
asynchronous and parallel operations. It would be good if other
developers weigh their opinions on designs as we propose them, since
this decides how Graphulo users interact with Graphulo.
* Instrumenting Graphulo for monitoring, profiling and benchmarking.
I have a blueprint on how to use HTrace to make these tasks as easy
as browsing a web page. Needs careful thought and discussion before
implementing, since this instrumentation will go everywhere. It
would be nice if Graphulo and Accumulo mirror instrumentation
strategies, so it would be good to have that discussion in the same
venue.
* Rigorous *scale testing*. Good instrumentation is key. With
successful scale testing, we paint a clear picture for which
operations Graphulo excels to potential adopters, ultimately
plotting where Graphulo stands in the world of big data software.
* Explicitly supporting the GraphBLAS <http://graphblas.org/> spec,
once it is agreed upon. Graphulo was designed from the ground up
with the GraphBLAS in mind, so this should be an easy task.
Aligning with this upcoming industry standard bodes well for ease of
developing Graphulo algorithms.
Developing more algorithms and applications will follow too, and I
imagine this as an excellent place where newcomer users can get involved.
Some other places Graphulo needs work worth mentioning are creating a
proper release framework (the release here
<https://github.com/Accla/graphulo/releases> could use improvement,
starting with signed artifacts) and reviewing the way Graphulo runs
tests (currently centered around a critical file called TEST_CONFIG.java
which is great for one developer, whereas a config file probably works
better). Both of these are places more experienced developers could
help. I should also mention that Graphulo has groundwork in place for
operations between Accumulo instances, but I doubt many users would need
that level of control.
Regarding IP, I'm happy to donate my commits to the ASF, which covers
99% of the Graphulo code base. I'm sure other issues will arise and we
can sort them out. Sean Busbey, perhaps I could ask your assistance as
someone more knowledgeable in this area. Regarding
dependencies, effectively every direct dependency is org.apache, so
nothing to worry about here.
I acknowledge that I will lose dictatorial power and gain some
bureaucratic / discussion overhead by moving from sole developer to an
Apache model. The benefits of a community are well worth it.
If we as a community decide that contrib is the right place for
Graphulo, then there are lots of logistical questions to decide like
where the code will live, where JIRA will live, what mailing lists to
use, what URL to give Graphulo within apache.org <http://apache.org>,
etc. We can tackle these at our leisure. Let's discuss Graphulo and
Accumulo here first.
Warmly,
Dylan Hutchison