Re: [DISCUSSION] Accumulo, another BigTable clone, has shown up on Apache Incubator as a proposal

Duane Moore Tue, 06 Sep 2011 09:22:37 -0700

Hello all,

I've been a lurker on the HBase list for a year or so and our company has
also been working with the Accumulo implementation during the same time
frame.  I'd like to respond to Stack's suggestion to focus on the
technical merits of the proposal.  Since I have some info on the pre-open
sourced version of Accumulo, I'd like to share some of our evaluation of
the software, primarily from a client perspective (vs. implementation
details like logging to NFS vs HDFS).


First, I share many of the same concerns of folks who were frustrated that
this project seems to duplicate the effort of the open source
(particularly HBase) community.  However, I will second what Todd and Joey
said and reiterate that contributing to open source is not easy for a
government contractor, and especially not easy for U.S. government
employees.  My personal preference for a long while has been to migrate
our Accumulo implementation to HBase, but as with any project there are
often non-technical considerations for doing so.

Below are some notes we took last year on the differences between Accumulo
and HBase, with additional notes from me inline.  Much of this mirrors
what is in the current Accumulo proposal.

-----

- Column Families
In HBase you must specify all column families up front as part of the
table schema declaration when creating a table.
Accumulo does not have this restriction, you do not declare column
families when you create a table. When you insert a new row into the table
you can just provide a new column family.
** Note: sounds like from what Stack said, this is close to being OBE?


- Aggregation
Accumulo offers the ability to specify an aggregator for an individual
column family or column. This allows you to keep a row count, or summation
of numerical values that may be stored in a particular column. It would
appear the function has to operate on the subset of values stored for that
column in the table at a particular time since it keeps the aggregate
value in memory. So this may not be able to handle certain aggregation
functions like 'median' for instance. But functions like sum, max, min,
mean, and count should all be supportable.
I could not find a comparable feature within HBase, but HBase does offer
an atomic function called incremementColumnValue on the HTable class which
appears can be leveraged to provide aggregation behavior.


- Column Visibility
This is the feature in Accumulo that allows tagging of the data at the
column level, which would primarily be used for classification markings
(in our scenario).
If we were to implement the same type of column visibility in HBase that
Accumulo supports, we would have potentially several options:
-Try to implement column visibility as a patch to HBase. Would be fun, but
may be a lot of work.
-Since the value of a particular column (cell, actually) is simply a byte
array, we could utilize a standard technique of encoding the visibility
level/classification in the column value itself.
-Since the number of columns is not pre-defined, adopt a convention
whereby each column "foo" gets an additional column added by our
infrastructure called "foo_visibility".
** Note: We have a requirement to use PKI (digital certificates) for
authentication in our service stack. The relationship between PKI and
Kerberos currently used for Secure HBase is interesting; not quite sure
how the two would fit together in practice.

-Retrieving Data
Accumulo uses a Scanner object for all retrieval operations, which are
instantiated by retrieving a Scanner from the Connector object. When
retrieving all values for a particular row, the _individual cells are
returned as a new entry_ returned by the Scanner iterator.
In HBase, you can use a Scan object (org.apache.hadoop.hbase.client.Scan)
or you can use a Get object, which allows you to retrieve a single row at
a time. In either case, the org.apache.hadoop.hbase.client.Result class is
returned, representing all of the requested data for that particular row.
In HBase, to set constraints on a query, you set a
org.apache.hadoop.hbase.filter.Filter object on the Scan object. Multiple
Filters may be set by using the FilterList object. In Accumulo, you call
the setScanIterators() method on the Scanner object, which enables the
appropriate iterators for use on the server before returning data.
** Note: primary difference here is in the use of server-side iterators,
which Andy has correctly pointed out could be implemented via the
coprocessor framework.  We did some initial investigation into
coprocessors to see if we could implement this equivalent functionality,
but since we'd been directed to use Accumulo, we didn't have much
bandwidth to address this (also coprocessors were in their infancy at the
time).



-----


Hope that helps.  Bottom line is that I believe that the features in
Accumulo can and ought to be merged into HBase at some point (assuming the
technical merits hold up).  Looking forward to contributing to that
conversation.

Thanks,
Duane

On 9/3/11 2:21 PM, "Stack" <[email protected]> wrote:

>
>I'd suggest we refocus this thread on how to respond to the Accumulo
>proposal (or whether to respond at all), since thats what we 'know'.
>I think it'd be useful correcting at least the 'unlikely tos' with
>pointers to committed code.
>
>Code overlap, if any, can be addressed when the code drop happens.
>
>St.Ack
>

Re: [DISCUSSION] Accumulo, another BigTable clone, has shown up on Apache Incubator as a proposal

Reply via email to