Re: Many 2 one in a row - modeling options

Jonathan Gray Fri, 21 Aug 2009 15:29:09 -0700

Excellent.  Nice to have some official numbers.

Thanks Tim.


JG

tim robertson wrote:

Just to keep this with the rest of the thread as this might be useful
to others using HBase.

I got a quick chance to test protobuf
(http://code.google.com/p/protobuf/) and on my Macbook pro and found
that it can serialize and deserialize a Collection of Objects, each
with 9 Strings, at a rate of 74 serializations and deserializations
per millisecond.  Deserialization alone was at a rate of 146 per
millisecond.

Thanks for pointing to protobuf Jonathan - seems it is well placed for
use in HBase where complex types are needed (e.g. my many2one in a
single row)

Tim


On Wed, Aug 19, 2009 at 10:14 PM, tim
robertson<[email protected]> wrote:

Thanks JG for taking the time to digest that and comment.

It was a hastily written page as I am on vacation... I'm heading over
to the bay area for a few days and wanted to start getting something
together to discuss with Stack and anyone else free on Tuesday 8th. I
hope it develops into a good case study for the HBase community.

In terms of operations, I think it will boil down to 3 things:
a) full scans building search indexes (probably lucene)

b) scanning and annotating same row (such as a field for a named
national park the point falls in)
- in terms of the scientific identification there will be a fair
amount of scanning the identification and then annotating a new column
with an ID for the equivalent of an external taxonomy. One example
might be that you want to browse the occurrence store (specimens and
observations) using the Catalogue of Life taxonomy, and this record
would be found with the identifier of
http://www.catalogueoflife.org/annual-checklist/2008/show_species_details.php?record_id=5204463.
It is not as simple as a name match as the synonymy is subject to
differing opinion.

c) scans with filters. I expect that most of these raw values will be
parsed out to better typed values (hash encoded lat long etc) and the
filters would be on those families and not these "raw" families. I
think the identifications would be parsed to ID's of well known
taxonomies, and the filters would be using those values.

I was expecting serializing to be the most likely choice, and I'll
start checking out the protobufs stuff - I have been writing my own
serializations based on Writable for storing values in lucene indexes
recently.

I'll clean up the wiki and probably have more questions,

Cheers,

Tim

On Wed, Aug 19, 2009 at 7:20 PM, Jonathan Gray<[email protected]> wrote:

Tim,

Very cool wiki page.  Unfortunately I'm a little confused about exactly what
the requirements are.

Does each species (and the combination of all of its identifications)
actually have a single, unique ID?

The most important thing when designing your HBase schema is to understand
how you want to query it.  And I'm not exactly sure I follow that part.

I'm going to assume that there is a single, relatively static set of
attributes for each unique ID (the GUID, Cat#, etc).  Let's put that in a
family, call it "attributes".  You would use that family as a key/value
dictionary.  The qualifier would be the attribute name, and the value would
be the attribute value (ie. attributes:InstCode with value MNHA).

The row, in this case, would be the GUID or whatever unique ID you want to
lookup by.

Now the other part, storing the identifications.  I would definitely vote
against multiples rows, multiple tables, and multiple families.  As you
point out, multiple tables would require joining, multiple families does in
fact mean 2 separate files, and multiple rows adds a great deal of
complexity (you need to Scan now, cannot rely on Get).

So let's say we have a family "identifications" (though you may want to
shorten these family names as they are actually stored explicitly for every
single cell... maybe "ids").  For each identification, you would have a
single column.  The qualifier of that column would be whatever the unique
identifier is for that identification, or if there isn't one, you could just
wrap up the entire thing in to a serialized type and use that as the
qualifier.  If you have an ID, then I would serialize the identification
into the value.

You point out that this would have poor scanning performance because of the
need for deserialization, but I don't necessarily agree.  That can be quite
fast, depending on implementation, and there's a great deal of
serialization/deserialization being done behind the scenes to even get the
data to you in the first place.

Something like protobufs has very efficient and fast serialize/deserialize
operations.  Java serialization is inefficient in space and can be slow,
which is why HBase and Hadoop implement the Writable interface and provide a
minimal/efficient/binary serialization.

I do think that is the by far the best approach here, the
serialization/deserialization should be orders of magnitude faster than
round-trip network latency.

I didn't realize your first bullet was what it was, I thought you were
talking about serializing the entire thing in one column.  Looking again, it
seems you're on the right track and that would be the simplest and fastest
approach.

Keep us updated!

JG



tim robertson wrote:

Hi all,

I have just started a project to research the migration of a
biodiversity occurrence index (plant / animal specimens collected or
observed) from mysql to HBase.

We have source records that inherently have a many 2 one.  Think of
"Scientist A identified this as a Felis concolor concolor" but 25
years later "Scientist B identified the same preserved specimen as a
Puma concolor".  This scientific identification has more attributes
and there will always be 1 or more (could be 10s of them) for the same
specimen.

I am pondering how to model this in HBase seeing a few obvious options:
- serializing the scientific identification "List" as bytes
- expanding the record into 2 or more rows indicating the rows were
derived from the same source
- expand the identifications into new families
- expand the identification fields into multiple fields in the same family
- consider more than 1 table

All of the above have pros and cons with respect to client code
complexity and performance.

I have put up a vrey simple example record on
http://code.google.com/p/biodiversity/wiki/HBaseSchema and would
welcome any comments on this list or on the wiki directly.

Please note that I have only just started the project so the
documentation is really just starting up at this point, but this will
be a case study of a migration from mysql which might be of interest
to others.

Thanks,

Tim

Re: Many 2 one in a row - modeling options

Reply via email to