On Mon, Oct 19, 2009 at 7:40 PM, yz5od2 <[email protected]>wrote:
> ok, so what you are saying is that my mapper should talk directly to Hbase
> to write the data into it? Or I should define my Mapper implementation class
> like
>
> Mapper<LongWritable,Text,Text,byte[]>
Your Mapper must output a Hadoop Writable. You have two options:
1. Handle HBase all yourself, and you are just using Hadoop as a way to
distribute your load and data across your cluster. Then you can just use
NullWritables and not call output.collect (0.19 API) or context.write (0.20
API) at all.
2. Output HBase Puts and Deletes from the Mapper and use TableOutputFormat.
Put and Delete extend Writable, but don't share a more specific superclass,
so the signature for the Mapper is the somewhat confusing <K1, V1, K2,
Writable>, where K1 and V1 are whatever is needed for your input, and K2 is
completely ignored.
The second one would involve writing less code. You would do something like
this:
byte[] rowId = ...;
byte[] content = pojo.serialize();
Put put = new Put(rowId);
put.add(Bytes.toBytes("content"), Bytes.toBytes("thrift-thingie"), content);
context.write(NullWritable.get(), put);
As Ryan says, you don't want to use Hadoop writables as your serialization
scheme, but they are part of the API to pass data to an output format.
I don't know if the first has any advantages. Probably flexibilty, and
better control over details like when to flush the commits.