Hi all, I am curious how people are structuring their data access code when using HBase, so I was hoping for insights from the community. I represent one of those developers with lots of experience with relational DBs and Spring and Hibernate etc. and now exploring HBase due to hitting limits in mysql (2 tables each with 200 million rows). This is *not* an RDBMS vs HBase question, but more related to how to cleanly structure application code once HBase has been decided upon.
So far, I have done 2 small sample projects each differently: - The first one, I kind of copied the Spring JDBC DAO approach, and created a POJO factory per column family, applied to each RowResult when scanning. So basically I abstracted a CRUD interface and search methods that handled POJO objects. I then did a Spring wiring of the DAO into the application (I guess it just felt normal to do that at that time ;) - The second one I had reasonably well defined terms in the application (e.g. dwc:scientificName) and then I built a layer that used various properties files to map from my well defined terms to tables, families and columns. E.g. an insertHarvestedRecord(Map<String, String> data) method might pick up a prop file mapping dwc:scientificName to the "unparsed" family, but another method might map it to a different column family altogether. Additionally, I was loading in lots of CSV data, and was able to do CSV column to HBase family:column kind of mapping, which worked nicely (although I would run this through MapReduce to load it now to distribute the loading) The first approach I found quite limiting as changes meant a lot of tedious coding and recompilation, but it did catch errors early. One of the motivations of this approach, was that I could also get other developers to work on top with no knowledge of the data store (possibly that was a con and not a pro anyway as I expect a lot of MapReduce operations on the data). The second approach I found super flexible, but the effort was in maintaining test cases to catch changes. I ended up dealing with a lot of List<Map<String, String>> situations, and definitely the data store became more "embedded" in the application code itself. Has anyone got any nice insights to share to those moving from the typical spring / hibernate world? Do you use the HBase API natively? Enumerations for column defs? Hardcoded Strings? Did the ORM thing (http://www.nabble.com/Hbase-ORM-any-one-interested--td19739869.html) take off? Maybe there is no one approach is best fit anyway... Cheers, Tim
