I have to recommend doing the puts via the API straight in the mapper.
 Passing all your data thru the shuffle is not necessary, since
inserting into hbase is a form of sorting.  Besides lets not copy a
100gb import more times than we have to, right?

On Mon, Oct 19, 2009 at 11:41 PM, Kevin Peterson <[email protected]> wrote:
> On Mon, Oct 19, 2009 at 7:40 PM, yz5od2 <[email protected]>wrote:
>
>> ok, so what you are saying is that my mapper should talk directly to Hbase
>> to write the data into it? Or I should define my Mapper implementation class
>> like
>>
>> Mapper<LongWritable,Text,Text,byte[]>
>
>
> Your Mapper must output a Hadoop Writable. You have two options:
>
> 1. Handle HBase all yourself, and you are just using Hadoop as a way to
> distribute your load and data across your cluster. Then you can just use
> NullWritables and not call output.collect (0.19 API) or context.write (0.20
> API) at all.
> 2. Output HBase Puts and Deletes from the Mapper and use TableOutputFormat.
> Put and Delete extend Writable, but don't share a more specific superclass,
> so the signature for the Mapper is the somewhat confusing <K1, V1, K2,
> Writable>, where K1 and V1 are whatever is needed for your input, and K2 is
> completely ignored.
>
> The second one would involve writing less code. You would do something like
> this:
>
> byte[] rowId = ...;
> byte[] content = pojo.serialize();
> Put put = new Put(rowId);
> put.add(Bytes.toBytes("content"), Bytes.toBytes("thrift-thingie"), content);
> context.write(NullWritable.get(), put);
>
> As Ryan says, you don't want to use Hadoop writables as your serialization
> scheme, but they are part of the API to pass data to an output format.
>
> I don't know if the first has any advantages. Probably flexibilty, and
> better control over details like when to flush the commits.
>

Reply via email to