Hi all, I have just started a project to research the migration of a biodiversity occurrence index (plant / animal specimens collected or observed) from mysql to HBase.
We have source records that inherently have a many 2 one. Think of "Scientist A identified this as a Felis concolor concolor" but 25 years later "Scientist B identified the same preserved specimen as a Puma concolor". This scientific identification has more attributes and there will always be 1 or more (could be 10s of them) for the same specimen. I am pondering how to model this in HBase seeing a few obvious options: - serializing the scientific identification "List" as bytes - expanding the record into 2 or more rows indicating the rows were derived from the same source - expand the identifications into new families - expand the identification fields into multiple fields in the same family - consider more than 1 table All of the above have pros and cons with respect to client code complexity and performance. I have put up a vrey simple example record on http://code.google.com/p/biodiversity/wiki/HBaseSchema and would welcome any comments on this list or on the wiki directly. Please note that I have only just started the project so the documentation is really just starting up at this point, but this will be a case study of a migration from mysql which might be of interest to others. Thanks, Tim
