Just to keep this with the rest of the thread as this might be useful to others using HBase.
I got a quick chance to test protobuf (http://code.google.com/p/protobuf/) and on my Macbook pro and found that it can serialize and deserialize a Collection of Objects, each with 9 Strings, at a rate of 74 serializations and deserializations per millisecond. Deserialization alone was at a rate of 146 per millisecond. Thanks for pointing to protobuf Jonathan - seems it is well placed for use in HBase where complex types are needed (e.g. my many2one in a single row) Tim On Wed, Aug 19, 2009 at 10:14 PM, tim robertson<[email protected]> wrote: > Thanks JG for taking the time to digest that and comment. > > It was a hastily written page as I am on vacation... I'm heading over > to the bay area for a few days and wanted to start getting something > together to discuss with Stack and anyone else free on Tuesday 8th. I > hope it develops into a good case study for the HBase community. > > In terms of operations, I think it will boil down to 3 things: > a) full scans building search indexes (probably lucene) > > b) scanning and annotating same row (such as a field for a named > national park the point falls in) > - in terms of the scientific identification there will be a fair > amount of scanning the identification and then annotating a new column > with an ID for the equivalent of an external taxonomy. One example > might be that you want to browse the occurrence store (specimens and > observations) using the Catalogue of Life taxonomy, and this record > would be found with the identifier of > http://www.catalogueoflife.org/annual-checklist/2008/show_species_details.php?record_id=5204463. > It is not as simple as a name match as the synonymy is subject to > differing opinion. > > c) scans with filters. I expect that most of these raw values will be > parsed out to better typed values (hash encoded lat long etc) and the > filters would be on those families and not these "raw" families. I > think the identifications would be parsed to ID's of well known > taxonomies, and the filters would be using those values. > > I was expecting serializing to be the most likely choice, and I'll > start checking out the protobufs stuff - I have been writing my own > serializations based on Writable for storing values in lucene indexes > recently. > > I'll clean up the wiki and probably have more questions, > > Cheers, > > Tim > > > > On Wed, Aug 19, 2009 at 7:20 PM, Jonathan Gray<[email protected]> wrote: >> Tim, >> >> Very cool wiki page. Unfortunately I'm a little confused about exactly what >> the requirements are. >> >> Does each species (and the combination of all of its identifications) >> actually have a single, unique ID? >> >> The most important thing when designing your HBase schema is to understand >> how you want to query it. And I'm not exactly sure I follow that part. >> >> I'm going to assume that there is a single, relatively static set of >> attributes for each unique ID (the GUID, Cat#, etc). Let's put that in a >> family, call it "attributes". You would use that family as a key/value >> dictionary. The qualifier would be the attribute name, and the value would >> be the attribute value (ie. attributes:InstCode with value MNHA). >> >> The row, in this case, would be the GUID or whatever unique ID you want to >> lookup by. >> >> Now the other part, storing the identifications. I would definitely vote >> against multiples rows, multiple tables, and multiple families. As you >> point out, multiple tables would require joining, multiple families does in >> fact mean 2 separate files, and multiple rows adds a great deal of >> complexity (you need to Scan now, cannot rely on Get). >> >> So let's say we have a family "identifications" (though you may want to >> shorten these family names as they are actually stored explicitly for every >> single cell... maybe "ids"). For each identification, you would have a >> single column. The qualifier of that column would be whatever the unique >> identifier is for that identification, or if there isn't one, you could just >> wrap up the entire thing in to a serialized type and use that as the >> qualifier. If you have an ID, then I would serialize the identification >> into the value. >> >> You point out that this would have poor scanning performance because of the >> need for deserialization, but I don't necessarily agree. That can be quite >> fast, depending on implementation, and there's a great deal of >> serialization/deserialization being done behind the scenes to even get the >> data to you in the first place. >> >> Something like protobufs has very efficient and fast serialize/deserialize >> operations. Java serialization is inefficient in space and can be slow, >> which is why HBase and Hadoop implement the Writable interface and provide a >> minimal/efficient/binary serialization. >> >> I do think that is the by far the best approach here, the >> serialization/deserialization should be orders of magnitude faster than >> round-trip network latency. >> >> I didn't realize your first bullet was what it was, I thought you were >> talking about serializing the entire thing in one column. Looking again, it >> seems you're on the right track and that would be the simplest and fastest >> approach. >> >> Keep us updated! >> >> JG >> >> >> >> tim robertson wrote: >>> >>> Hi all, >>> >>> I have just started a project to research the migration of a >>> biodiversity occurrence index (plant / animal specimens collected or >>> observed) from mysql to HBase. >>> >>> We have source records that inherently have a many 2 one. Think of >>> "Scientist A identified this as a Felis concolor concolor" but 25 >>> years later "Scientist B identified the same preserved specimen as a >>> Puma concolor". This scientific identification has more attributes >>> and there will always be 1 or more (could be 10s of them) for the same >>> specimen. >>> >>> I am pondering how to model this in HBase seeing a few obvious options: >>> - serializing the scientific identification "List" as bytes >>> - expanding the record into 2 or more rows indicating the rows were >>> derived from the same source >>> - expand the identifications into new families >>> - expand the identification fields into multiple fields in the same family >>> - consider more than 1 table >>> >>> All of the above have pros and cons with respect to client code >>> complexity and performance. >>> >>> I have put up a vrey simple example record on >>> http://code.google.com/p/biodiversity/wiki/HBaseSchema and would >>> welcome any comments on this list or on the wiki directly. >>> >>> Please note that I have only just started the project so the >>> documentation is really just starting up at this point, but this will >>> be a case study of a migration from mysql which might be of interest >>> to others. >>> >>> Thanks, >>> >>> Tim >>> >> >
