Tim,
Very cool wiki page. Unfortunately I'm a little confused about exactly what
the requirements are.
Does each species (and the combination of all of its identifications)
actually have a single, unique ID?
The most important thing when designing your HBase schema is to understand
how you want to query it. And I'm not exactly sure I follow that part.
I'm going to assume that there is a single, relatively static set of
attributes for each unique ID (the GUID, Cat#, etc). Let's put that in a
family, call it "attributes". You would use that family as a key/value
dictionary. The qualifier would be the attribute name, and the value would
be the attribute value (ie. attributes:InstCode with value MNHA).
The row, in this case, would be the GUID or whatever unique ID you want to
lookup by.
Now the other part, storing the identifications. I would definitely vote
against multiples rows, multiple tables, and multiple families. As you
point out, multiple tables would require joining, multiple families does in
fact mean 2 separate files, and multiple rows adds a great deal of
complexity (you need to Scan now, cannot rely on Get).
So let's say we have a family "identifications" (though you may want to
shorten these family names as they are actually stored explicitly for every
single cell... maybe "ids"). For each identification, you would have a
single column. The qualifier of that column would be whatever the unique
identifier is for that identification, or if there isn't one, you could just
wrap up the entire thing in to a serialized type and use that as the
qualifier. If you have an ID, then I would serialize the identification
into the value.
You point out that this would have poor scanning performance because of the
need for deserialization, but I don't necessarily agree. That can be quite
fast, depending on implementation, and there's a great deal of
serialization/deserialization being done behind the scenes to even get the
data to you in the first place.
Something like protobufs has very efficient and fast serialize/deserialize
operations. Java serialization is inefficient in space and can be slow,
which is why HBase and Hadoop implement the Writable interface and provide a
minimal/efficient/binary serialization.
I do think that is the by far the best approach here, the
serialization/deserialization should be orders of magnitude faster than
round-trip network latency.
I didn't realize your first bullet was what it was, I thought you were
talking about serializing the entire thing in one column. Looking again, it
seems you're on the right track and that would be the simplest and fastest
approach.
Keep us updated!
JG
tim robertson wrote:
Hi all,
I have just started a project to research the migration of a
biodiversity occurrence index (plant / animal specimens collected or
observed) from mysql to HBase.
We have source records that inherently have a many 2 one. Think of
"Scientist A identified this as a Felis concolor concolor" but 25
years later "Scientist B identified the same preserved specimen as a
Puma concolor". This scientific identification has more attributes
and there will always be 1 or more (could be 10s of them) for the same
specimen.
I am pondering how to model this in HBase seeing a few obvious options:
- serializing the scientific identification "List" as bytes
- expanding the record into 2 or more rows indicating the rows were
derived from the same source
- expand the identifications into new families
- expand the identification fields into multiple fields in the same family
- consider more than 1 table
All of the above have pros and cons with respect to client code
complexity and performance.
I have put up a vrey simple example record on
http://code.google.com/p/biodiversity/wiki/HBaseSchema and would
welcome any comments on this list or on the wiki directly.
Please note that I have only just started the project so the
documentation is really just starting up at this point, but this will
be a case study of a migration from mysql which might be of interest
to others.
Thanks,
Tim