Hello all, I've been a lurker on the HBase list for a year or so and our company has also been working with the Accumulo implementation during the same time frame. I'd like to respond to Stack's suggestion to focus on the technical merits of the proposal. Since I have some info on the pre-open sourced version of Accumulo, I'd like to share some of our evaluation of the software, primarily from a client perspective (vs. implementation details like logging to NFS vs HDFS).
First, I share many of the same concerns of folks who were frustrated that this project seems to duplicate the effort of the open source (particularly HBase) community. However, I will second what Todd and Joey said and reiterate that contributing to open source is not easy for a government contractor, and especially not easy for U.S. government employees. My personal preference for a long while has been to migrate our Accumulo implementation to HBase, but as with any project there are often non-technical considerations for doing so. Below are some notes we took last year on the differences between Accumulo and HBase, with additional notes from me inline. Much of this mirrors what is in the current Accumulo proposal. ----- - Column Families In HBase you must specify all column families up front as part of the table schema declaration when creating a table. Accumulo does not have this restriction, you do not declare column families when you create a table. When you insert a new row into the table you can just provide a new column family. ** Note: sounds like from what Stack said, this is close to being OBE? - Aggregation Accumulo offers the ability to specify an aggregator for an individual column family or column. This allows you to keep a row count, or summation of numerical values that may be stored in a particular column. It would appear the function has to operate on the subset of values stored for that column in the table at a particular time since it keeps the aggregate value in memory. So this may not be able to handle certain aggregation functions like 'median' for instance. But functions like sum, max, min, mean, and count should all be supportable. I could not find a comparable feature within HBase, but HBase does offer an atomic function called incremementColumnValue on the HTable class which appears can be leveraged to provide aggregation behavior. - Column Visibility This is the feature in Accumulo that allows tagging of the data at the column level, which would primarily be used for classification markings (in our scenario). If we were to implement the same type of column visibility in HBase that Accumulo supports, we would have potentially several options: -Try to implement column visibility as a patch to HBase. Would be fun, but may be a lot of work. -Since the value of a particular column (cell, actually) is simply a byte array, we could utilize a standard technique of encoding the visibility level/classification in the column value itself. -Since the number of columns is not pre-defined, adopt a convention whereby each column "foo" gets an additional column added by our infrastructure called "foo_visibility". ** Note: We have a requirement to use PKI (digital certificates) for authentication in our service stack. The relationship between PKI and Kerberos currently used for Secure HBase is interesting; not quite sure how the two would fit together in practice. -Retrieving Data Accumulo uses a Scanner object for all retrieval operations, which are instantiated by retrieving a Scanner from the Connector object. When retrieving all values for a particular row, the _individual cells are returned as a new entry_ returned by the Scanner iterator. In HBase, you can use a Scan object (org.apache.hadoop.hbase.client.Scan) or you can use a Get object, which allows you to retrieve a single row at a time. In either case, the org.apache.hadoop.hbase.client.Result class is returned, representing all of the requested data for that particular row. In HBase, to set constraints on a query, you set a org.apache.hadoop.hbase.filter.Filter object on the Scan object. Multiple Filters may be set by using the FilterList object. In Accumulo, you call the setScanIterators() method on the Scanner object, which enables the appropriate iterators for use on the server before returning data. ** Note: primary difference here is in the use of server-side iterators, which Andy has correctly pointed out could be implemented via the coprocessor framework. We did some initial investigation into coprocessors to see if we could implement this equivalent functionality, but since we'd been directed to use Accumulo, we didn't have much bandwidth to address this (also coprocessors were in their infancy at the time). ----- Hope that helps. Bottom line is that I believe that the features in Accumulo can and ought to be merged into HBase at some point (assuming the technical merits hold up). Looking forward to contributing to that conversation. Thanks, Duane On 9/3/11 2:21 PM, "Stack" <[email protected]> wrote: > >I'd suggest we refocus this thread on how to respond to the Accumulo >proposal (or whether to respond at all), since thats what we 'know'. >I think it'd be useful correcting at least the 'unlikely tos' with >pointers to committed code. > >Code overlap, if any, can be addressed when the code drop happens. > >St.Ack >
