Hey everyone, So I restarted nutchbase efforts with adding an abstraction to the hbase api. The idea is to use an intermediate nutch api (which then talks with hbase) instead of communicating with hbase directly. This allows us a) to not be completely tied down to hbase, making a move to another db in the future easier b) perhaps to immediately support multiple databases with easy data migration between them.
What I have is very very (VERY) early and extremely alpha but I am quite happy with overall idea so I am sharing it for suggestions and reviews. Again, instead of using hbase directly, nutch will use a nice java bean with getters and setters. Nutch will then figure out what to read/write into hbase. I decided to use avro because it has a very clean design. Here is a very basic WebTableRow class: {"namespace": "org.apache.nutch.storage", "protocol": "Web", "types": [ {"name": "WebTableRow", "type": "record", "fields": [ {"name": "rowKey", "type": "string"}, {"name": "fetchTime", "type": "long"}, {"name": "title", "type": "string"}, {"name": "text", "type": "string"}, {"name": "status", "type": "int"} ] } ] } (ignore "protocol". I haven't yet figured out how to compile schemas without protocols) I have copied and modified avro's SpecificCompiler to generate a java class. It is mostly the same class as avro's SpecificCompiler however the variables are all private and are accessed through getters and setters. Here is a portion of the file: public class WebTableRow extends NutchTableRow< Utf8> implements SpecificRecord { @RowKey // these are used for reflection private Utf8 rowKey; @RowField private long fetchTime; @RowField private Utf8 title; @RowField private Utf8 text; @RowField private int status; public Utf8 getRowKey() { .... } public void setRowKey(Utf8 value) {....} public long getFetchTime() { .... } public void setFetchTime(long value) { .... } ..... Note that NutchTableRow extends SpecificRecordBase so this is a proper avro record. In the future, once hadoop MR supports avro as a serialization format NutchTableRow-s can easily be output through maps and reduces which is a nice bonus. We need to force the usage of setters instead of direct access to variables. Because one of the nice things about hbase is that you only update the columns that you changed. However to know which fields are updated (and thus, map them to hbase columns), we must keep track of what changed. Currently, NutchTableRow keeps a BitSet for all fields and all setter functions update this BitSet so we know exactly what changed. There is also a new interface called NutchSerializer that defines readRow and writeRow methods(it also needs scans, delete rows etc.. but that's for later). Currently HbaseSerializer implements NutchSerializer and reads and writes WebTableRow-s. HbaseSerializer currently works via reflection. It should be easy to add code generation to our SpecificCompiler so that we can also output a WebTableRowHbaseSerializer along with WebTableRow instead of using reflection. What I have currently can read and write primitive types + strings into and from hbase. You can check it out from github.com/dogacan/nutchbase (branch master, package o.a.n.storage). Again, I would like to note that the code is very very alpha and is not in a good shape but it should be a good starting point if you are interested. Once hbase support is solid, I intend to add support for other databases (bdb, cassandra and sql come to mind). If I got everything right, then moving data from one database to another is an incredibly trivial task. So, you can start with, say, bdb then switch over to hbase once your data gets large. Oh I forgot... HbaseSerializer reads a hbase-mapping.xml file that describes the mapping between fields and hbase columns: <table name="webtable" class="org.apache.nutch.storage.WebTableRow"> <description> <family name="p"/> <!-- This can also have params like compression, bloom filters --> <family name="f"/> </description> <fields> <field name="fetchTime" family="f" qualifier="ts"/> <field name="title" family="p" qualifier="t"/> <field name="text" family="p" qualifier="c"/> <field name="status" family="f" qualifier="st"/> </fields> Sorry for the long and rambling email. Feel free to ask if anything is unclear (and I assume it must be, given my incoherent description :) -- Doğacan Güney