Hi all,

I have just started a project to research the migration of a
biodiversity occurrence index (plant / animal specimens collected or
observed) from mysql to HBase.

We have source records that inherently have a many 2 one.  Think of
"Scientist A identified this as a Felis concolor concolor" but 25
years later "Scientist B identified the same preserved specimen as a
Puma concolor".  This scientific identification has more attributes
and there will always be 1 or more (could be 10s of them) for the same
specimen.

I am pondering how to model this in HBase seeing a few obvious options:
- serializing the scientific identification "List" as bytes
- expanding the record into 2 or more rows indicating the rows were
derived from the same source
- expand the identifications into new families
- expand the identification fields into multiple fields in the same family
- consider more than 1 table

All of the above have pros and cons with respect to client code
complexity and performance.

I have put up a vrey simple example record on
http://code.google.com/p/biodiversity/wiki/HBaseSchema and would
welcome any comments on this list or on the wiki directly.

Please note that I have only just started the project so the
documentation is really just starting up at this point, but this will
be a case study of a migration from mysql which might be of interest
to others.

Thanks,

Tim

Reply via email to