Hi Kiyan, On Tue, Oct 2, 2012 at 11:32 PM, Kiyan Ahmadizadeh <[email protected]> wrote: > HBase allows clients to load data into HBase by generating HFiles in a > MapReduce job and then loading those HFiles into HBase via running the > CompleteBulkLoad tool. We'd like to enable this behavior in Crunch. > > Getting crunch to generate HFiles as the result of the job is as simple as > configuring the correct output format. The question of where/when to > invoke the CompleteBulkLoad tool on those generated files is a little > trickier. I originally posed this question to just Josh but on his > suggestion I thought I'd open it up to the whole group. Josh's original > response is below and suggests adding a callback mechanism to Target. This > sounds like a good idea to me. Does anyone else have some thoughts / ideas > on the issue?
It's been quite a while since I worked with bulk imports in HBase, but from what I remember previously (and taking a look in the current HBase trunk), I don't think it's necessarily as simple as just writing to HFileOutputFormat to do a bulk load. I think (and please correct me if I'm wrong) that additional requirements for doing a bulk load (at least for an existing table) are having all Puts (or KeyValues) sorted by total order, as well as having all KeyValues partitioned according to existing regions, with the partitioning being consistent over all column families. These components can probably be largely plugged into a pipeline, but it's more complex than just just setting the output format to HFileOutputFormat. Seeing as there is extra functionality (i.e. dedicated pipeline code) needed in order to facilitate this, I'm wondering if adding callback hooks to Target is worth it -- it might be easier to just add a call to pipeline.run() and then run the CompleteBulkLoad tool in the dedicated pipeline code that sets up the sorted and partitioned HFiles. - Gabriel
