Re: State of nutchbase

xiao yang Sat, 23 Jan 2010 01:50:29 -0800

So all components such as Injector, Generator, Fetcher, Indexer will read
table name from this mapping file?
The commands will be different from the current version.


2009/12/8 Doğacan Güney <doga...@gmail.com>

> Hey everyone,
>
> So I restarted nutchbase efforts with adding an abstraction to the hbase
> api. The idea is to use an intermediate nutch api (which then talks with
> hbase) instead of communicating with hbase directly. This allows us a) to
> not be completely tied down to hbase, making a move to another db in the
> future easier b) perhaps to immediately support multiple databases with easy
> data migration between them.
>
> What I have is very very (VERY) early and extremely alpha but I am quite
> happy with overall idea so I am sharing it for suggestions and reviews.
> Again, instead of using hbase directly, nutch will use a nice java bean with
> getters and setters. Nutch will then figure out what to read/write into
> hbase.
>
> I decided to use avro because it has a very clean design. Here is a  very
> basic WebTableRow class:
> {"namespace": "org.apache.nutch.storage",
>  "protocol": "Web",
>
>  "types": [
>      {"name": "WebTableRow", "type": "record",
>       "fields": [
>           {"name": "rowKey", "type": "string"},
>           {"name": "fetchTime", "type": "long"},
>           {"name": "title", "type": "string"},
>           {"name": "text", "type": "string"},
>           {"name": "status", "type": "int"}
>       ]
>      }
>  ]
> }
>
> (ignore "protocol". I haven't yet figured out how to compile schemas
> without protocols)
>
> I have copied and modified avro's SpecificCompiler to generate a java
> class. It is mostly the same class as avro's SpecificCompiler however the
> variables are all private and are accessed through getters and setters. Here
> is a portion of the file:
>
> public class WebTableRow extends NutchTableRow< Utf8> implements
> SpecificRecord {
>   @RowKey // these are used for reflection
>   private Utf8 rowKey;
>   @RowField
>   private long fetchTime;
>   @RowField
>   private Utf8 title;
>   @RowField
>   private Utf8 text;
>   @RowField
>   private int status;
>   public Utf8 getRowKey() { .... }
>   public void setRowKey(Utf8 value) {....}
>   public long getFetchTime() { .... }
>   public void setFetchTime(long value) { .... }
>   .....
>
> Note that NutchTableRow extends SpecificRecordBase so this is a proper avro
> record. In the future, once hadoop MR supports avro as a serialization
> format NutchTableRow-s can easily be output through maps and reduces which
> is a nice bonus.
>
> We need to force the usage of setters instead of direct access to
> variables. Because one of the nice things about hbase is that you only
> update the columns that you changed. However to know which fields are
> updated (and thus, map them to hbase columns), we must keep track of what
> changed. Currently, NutchTableRow keeps a BitSet for all fields and all
> setter functions update this BitSet so we know exactly what changed.
>
> There is also a new interface called NutchSerializer that defines readRow
> and writeRow methods(it also needs scans, delete rows etc.. but that's for
> later). Currently HbaseSerializer implements NutchSerializer and reads and
> writes WebTableRow-s. HbaseSerializer currently works via reflection. It
> should be easy to add code generation to our SpecificCompiler so that we can
> also output a WebTableRowHbaseSerializer along with WebTableRow instead of
> using reflection.
>
> What I have currently can read and write primitive types + strings into and
> from hbase. You can check it out from github.com/dogacan/nutchbase (branch
> master, package o.a.n.storage). Again, I would like to note that the code is
> very very alpha and is not in a good shape but it should be a good starting
> point if you are interested.
>
> Once hbase support is solid, I intend to add support for other databases
> (bdb, cassandra and sql come to mind). If I got everything right, then
> moving data from one database to another is an incredibly trivial task. So,
> you can start with, say, bdb then switch over to hbase once your data gets
> large.
>
> Oh I forgot... HbaseSerializer reads a hbase-mapping.xml file that
> describes the mapping between fields and hbase columns:
>
> <table name="webtable" class="org.apache.nutch.storage.WebTableRow">
>   <description>
>     <family name="p"/> <!-- This can also have params like compression,
> bloom filters -->
>     <family name="f"/>
>   </description>
>   <fields>
>     <field name="fetchTime" family="f" qualifier="ts"/>
>     <field name="title" family="p" qualifier="t"/>
>     <field name="text" family="p" qualifier="c"/>
>     <field name="status" family="f" qualifier="st"/>
>   </fields>
>
> Sorry for the long and rambling email. Feel free to ask if anything is
> unclear (and I assume it must be, given my incoherent description :)
> --
> Doğacan Güney
>
>

Re: State of nutchbase

Reply via email to