Re: Many 2 one in a row - modeling options

tim robertson Fri, 21 Aug 2009 13:56:58 -0700

Just to keep this with the rest of the thread as this might be useful
to others using HBase.


I got a quick chance to test protobuf
(http://code.google.com/p/protobuf/) and on my Macbook pro and found
that it can serialize and deserialize a Collection of Objects, each
with 9 Strings, at a rate of 74 serializations and deserializations
per millisecond.  Deserialization alone was at a rate of 146 per
millisecond.

Thanks for pointing to protobuf Jonathan - seems it is well placed for
use in HBase where complex types are needed (e.g. my many2one in a
single row)

Tim


On Wed, Aug 19, 2009 at 10:14 PM, tim
robertson<[email protected]> wrote:
> Thanks JG for taking the time to digest that and comment.
>
> It was a hastily written page as I am on vacation... I'm heading over
> to the bay area for a few days and wanted to start getting something
> together to discuss with Stack and anyone else free on Tuesday 8th.  I
> hope it develops into a good case study for the HBase community.
>
> In terms of operations, I think it will boil down to 3 things:
>  a) full scans building search indexes (probably lucene)
>
> b) scanning and annotating same row (such as a field for a named
> national park the point falls in)
>    -  in terms of the scientific identification there will be a fair
> amount of scanning the identification and then annotating a new column
> with an ID for the equivalent of an external taxonomy.  One example
> might be that you want to browse the occurrence store (specimens and
> observations) using the Catalogue of Life taxonomy, and this record
> would be found with the identifier of
> http://www.catalogueoflife.org/annual-checklist/2008/show_species_details.php?record_id=5204463.
>  It is not as simple as a name match as the synonymy is subject to
> differing opinion.
>
> c) scans with filters.  I expect that most of these raw values will be
> parsed out to better typed values (hash encoded lat long etc) and the
> filters would be on those families and not these "raw" families.  I
> think the identifications would be parsed to ID's of well known
> taxonomies, and the filters would be using those values.
>
> I was expecting serializing to be the most likely choice, and I'll
> start checking out the protobufs stuff - I have been writing my own
> serializations based on Writable for storing values in lucene indexes
> recently.
>
> I'll clean up the wiki and probably have more questions,
>
> Cheers,
>
> Tim
>
>
>
> On Wed, Aug 19, 2009 at 7:20 PM, Jonathan Gray<[email protected]> wrote:
>> Tim,
>>
>> Very cool wiki page.  Unfortunately I'm a little confused about exactly what
>> the requirements are.
>>
>> Does each species (and the combination of all of its identifications)
>> actually have a single, unique ID?
>>
>> The most important thing when designing your HBase schema is to understand
>> how you want to query it.  And I'm not exactly sure I follow that part.
>>
>> I'm going to assume that there is a single, relatively static set of
>> attributes for each unique ID (the GUID, Cat#, etc).  Let's put that in a
>> family, call it "attributes".  You would use that family as a key/value
>> dictionary.  The qualifier would be the attribute name, and the value would
>> be the attribute value (ie. attributes:InstCode with value MNHA).
>>
>> The row, in this case, would be the GUID or whatever unique ID you want to
>> lookup by.
>>
>> Now the other part, storing the identifications.  I would definitely vote
>> against multiples rows, multiple tables, and multiple families.  As you
>> point out, multiple tables would require joining, multiple families does in
>> fact mean 2 separate files, and multiple rows adds a great deal of
>> complexity (you need to Scan now, cannot rely on Get).
>>
>> So let's say we have a family "identifications" (though you may want to
>> shorten these family names as they are actually stored explicitly for every
>> single cell... maybe "ids").  For each identification, you would have a
>> single column.  The qualifier of that column would be whatever the unique
>> identifier is for that identification, or if there isn't one, you could just
>> wrap up the entire thing in to a serialized type and use that as the
>> qualifier.  If you have an ID, then I would serialize the identification
>> into the value.
>>
>> You point out that this would have poor scanning performance because of the
>> need for deserialization, but I don't necessarily agree.  That can be quite
>> fast, depending on implementation, and there's a great deal of
>> serialization/deserialization being done behind the scenes to even get the
>> data to you in the first place.
>>
>> Something like protobufs has very efficient and fast serialize/deserialize
>> operations.  Java serialization is inefficient in space and can be slow,
>> which is why HBase and Hadoop implement the Writable interface and provide a
>> minimal/efficient/binary serialization.
>>
>> I do think that is the by far the best approach here, the
>> serialization/deserialization should be orders of magnitude faster than
>> round-trip network latency.
>>
>> I didn't realize your first bullet was what it was, I thought you were
>> talking about serializing the entire thing in one column.  Looking again, it
>> seems you're on the right track and that would be the simplest and fastest
>> approach.
>>
>> Keep us updated!
>>
>> JG
>>
>>
>>
>> tim robertson wrote:
>>>
>>> Hi all,
>>>
>>> I have just started a project to research the migration of a
>>> biodiversity occurrence index (plant / animal specimens collected or
>>> observed) from mysql to HBase.
>>>
>>> We have source records that inherently have a many 2 one.  Think of
>>> "Scientist A identified this as a Felis concolor concolor" but 25
>>> years later "Scientist B identified the same preserved specimen as a
>>> Puma concolor".  This scientific identification has more attributes
>>> and there will always be 1 or more (could be 10s of them) for the same
>>> specimen.
>>>
>>> I am pondering how to model this in HBase seeing a few obvious options:
>>> - serializing the scientific identification "List" as bytes
>>> - expanding the record into 2 or more rows indicating the rows were
>>> derived from the same source
>>> - expand the identifications into new families
>>> - expand the identification fields into multiple fields in the same family
>>> - consider more than 1 table
>>>
>>> All of the above have pros and cons with respect to client code
>>> complexity and performance.
>>>
>>> I have put up a vrey simple example record on
>>> http://code.google.com/p/biodiversity/wiki/HBaseSchema and would
>>> welcome any comments on this list or on the wiki directly.
>>>
>>> Please note that I have only just started the project so the
>>> documentation is really just starting up at this point, but this will
>>> be a case study of a migration from mysql which might be of interest
>>> to others.
>>>
>>> Thanks,
>>>
>>> Tim
>>>
>>
>

Re: Many 2 one in a row - modeling options

Reply via email to