So all components such as Injector, Generator, Fetcher, Indexer will read table name from this mapping file? The commands will be different from the current version.
2009/12/8 Doğacan Güney <doga...@gmail.com> > Hey everyone, > > So I restarted nutchbase efforts with adding an abstraction to the hbase > api. The idea is to use an intermediate nutch api (which then talks with > hbase) instead of communicating with hbase directly. This allows us a) to > not be completely tied down to hbase, making a move to another db in the > future easier b) perhaps to immediately support multiple databases with easy > data migration between them. > > What I have is very very (VERY) early and extremely alpha but I am quite > happy with overall idea so I am sharing it for suggestions and reviews. > Again, instead of using hbase directly, nutch will use a nice java bean with > getters and setters. Nutch will then figure out what to read/write into > hbase. > > I decided to use avro because it has a very clean design. Here is a very > basic WebTableRow class: > {"namespace": "org.apache.nutch.storage", > "protocol": "Web", > > "types": [ > {"name": "WebTableRow", "type": "record", > "fields": [ > {"name": "rowKey", "type": "string"}, > {"name": "fetchTime", "type": "long"}, > {"name": "title", "type": "string"}, > {"name": "text", "type": "string"}, > {"name": "status", "type": "int"} > ] > } > ] > } > > (ignore "protocol". I haven't yet figured out how to compile schemas > without protocols) > > I have copied and modified avro's SpecificCompiler to generate a java > class. It is mostly the same class as avro's SpecificCompiler however the > variables are all private and are accessed through getters and setters. Here > is a portion of the file: > > public class WebTableRow extends NutchTableRow< Utf8> implements > SpecificRecord { > @RowKey // these are used for reflection > private Utf8 rowKey; > @RowField > private long fetchTime; > @RowField > private Utf8 title; > @RowField > private Utf8 text; > @RowField > private int status; > public Utf8 getRowKey() { .... } > public void setRowKey(Utf8 value) {....} > public long getFetchTime() { .... } > public void setFetchTime(long value) { .... } > ..... > > Note that NutchTableRow extends SpecificRecordBase so this is a proper avro > record. In the future, once hadoop MR supports avro as a serialization > format NutchTableRow-s can easily be output through maps and reduces which > is a nice bonus. > > We need to force the usage of setters instead of direct access to > variables. Because one of the nice things about hbase is that you only > update the columns that you changed. However to know which fields are > updated (and thus, map them to hbase columns), we must keep track of what > changed. Currently, NutchTableRow keeps a BitSet for all fields and all > setter functions update this BitSet so we know exactly what changed. > > There is also a new interface called NutchSerializer that defines readRow > and writeRow methods(it also needs scans, delete rows etc.. but that's for > later). Currently HbaseSerializer implements NutchSerializer and reads and > writes WebTableRow-s. HbaseSerializer currently works via reflection. It > should be easy to add code generation to our SpecificCompiler so that we can > also output a WebTableRowHbaseSerializer along with WebTableRow instead of > using reflection. > > What I have currently can read and write primitive types + strings into and > from hbase. You can check it out from github.com/dogacan/nutchbase (branch > master, package o.a.n.storage). Again, I would like to note that the code is > very very alpha and is not in a good shape but it should be a good starting > point if you are interested. > > Once hbase support is solid, I intend to add support for other databases > (bdb, cassandra and sql come to mind). If I got everything right, then > moving data from one database to another is an incredibly trivial task. So, > you can start with, say, bdb then switch over to hbase once your data gets > large. > > Oh I forgot... HbaseSerializer reads a hbase-mapping.xml file that > describes the mapping between fields and hbase columns: > > <table name="webtable" class="org.apache.nutch.storage.WebTableRow"> > <description> > <family name="p"/> <!-- This can also have params like compression, > bloom filters --> > <family name="f"/> > </description> > <fields> > <field name="fetchTime" family="f" qualifier="ts"/> > <field name="title" family="p" qualifier="t"/> > <field name="text" family="p" qualifier="c"/> > <field name="status" family="f" qualifier="st"/> > </fields> > > Sorry for the long and rambling email. Feel free to ask if anything is > unclear (and I assume it must be, given my incoherent description :) > -- > Doğacan Güney > >