I have to recommend doing the puts via the API straight in the mapper. Passing all your data thru the shuffle is not necessary, since inserting into hbase is a form of sorting. Besides lets not copy a 100gb import more times than we have to, right?
On Mon, Oct 19, 2009 at 11:41 PM, Kevin Peterson <[email protected]> wrote: > On Mon, Oct 19, 2009 at 7:40 PM, yz5od2 <[email protected]>wrote: > >> ok, so what you are saying is that my mapper should talk directly to Hbase >> to write the data into it? Or I should define my Mapper implementation class >> like >> >> Mapper<LongWritable,Text,Text,byte[]> > > > Your Mapper must output a Hadoop Writable. You have two options: > > 1. Handle HBase all yourself, and you are just using Hadoop as a way to > distribute your load and data across your cluster. Then you can just use > NullWritables and not call output.collect (0.19 API) or context.write (0.20 > API) at all. > 2. Output HBase Puts and Deletes from the Mapper and use TableOutputFormat. > Put and Delete extend Writable, but don't share a more specific superclass, > so the signature for the Mapper is the somewhat confusing <K1, V1, K2, > Writable>, where K1 and V1 are whatever is needed for your input, and K2 is > completely ignored. > > The second one would involve writing less code. You would do something like > this: > > byte[] rowId = ...; > byte[] content = pojo.serialize(); > Put put = new Put(rowId); > put.add(Bytes.toBytes("content"), Bytes.toBytes("thrift-thingie"), content); > context.write(NullWritable.get(), put); > > As Ryan says, you don't want to use Hadoop writables as your serialization > scheme, but they are part of the API to pass data to an output format. > > I don't know if the first has any advantages. Probably flexibilty, and > better control over details like when to flush the commits. >
