HBase integration - DAO vs more "loosely defined" data access

tim robertson Sun, 28 Jun 2009 07:43:54 -0700

Hi all,

I am curious how people are structuring their data access code when
using HBase, so I was hoping for insights from the community.
I represent one of those developers with lots of experience with
relational DBs and Spring and Hibernate etc. and now exploring HBase
due to hitting limits in mysql (2 tables each with 200 million rows).
This is *not* an RDBMS vs HBase question, but more related to how to
cleanly structure application code once HBase has been decided upon.


So far, I have done 2 small sample projects each differently:

- The first one, I kind of copied the Spring JDBC DAO approach, and
created a POJO factory per column family, applied to each RowResult
when scanning.  So basically I abstracted a CRUD interface and search
methods that handled POJO objects.  I then did a Spring wiring of the
DAO into the application (I guess it just felt normal to do that at
that time ;)

- The second one I had reasonably well defined terms in the
application (e.g. dwc:scientificName) and then I built a layer that
used various properties files to map from my well defined terms to
tables, families and columns.  E.g. an
insertHarvestedRecord(Map<String, String> data) method might pick up a
prop file mapping dwc:scientificName to the "unparsed" family, but
another method might map it to a different column family altogether.
Additionally, I was loading in lots of CSV data, and was able to do
CSV column to HBase family:column kind of mapping, which worked nicely
(although I would run this through MapReduce to load it now to
distribute the loading)

The first approach I found quite limiting as changes meant a lot of
tedious coding and recompilation, but it did catch errors early.  One
of the motivations of this approach, was that I could also get other
developers to work on top with no knowledge of the data store
(possibly that was a con and not a pro anyway as I expect a lot of
MapReduce operations on the data).

The second approach I found super flexible, but the effort was in
maintaining test cases to catch changes.  I ended up dealing with a
lot of List<Map<String, String>> situations, and definitely the data
store became more "embedded" in the application code itself.

Has anyone got any nice insights to share to those moving from the
typical spring / hibernate world?
Do you use the HBase API natively?
Enumerations for column defs?  Hardcoded Strings?
Did the ORM thing
(http://www.nabble.com/Hbase-ORM-any-one-interested--td19739869.html)
take off?
Maybe there is no one approach is best fit anyway...

Cheers,

Tim

HBase integration - DAO vs more "loosely defined" data access

Reply via email to