Hbase + mapreduce -- operational design question

Dhodapkar, Chinmay Fri, 09 Sep 2011 14:55:16 -0700

Hello,
I have a setup where a bunch of clients store 'events' in an Hbase table . 
Also, periodically(once a day), I run a mapreduce job that goes over the table 
and computes some reports.


Now my issue is that the next time I don't want mapreduce job to process the 
'events' that it has already processed previously. I know that I can mark 
processed event in the hbase table and the mapper can filter them them out 
during the next run. But what I would really like/want is that previously 
processed events don't even hit the mapper.

One solution I can think of is to backup the hbase table after running the job 
and then clear the table. But this has lot of problems..
1) Clients may have inserted events while the job was running.
2) I could disable and drop the table and then create it again...but then the 
clients would complain about this short window of unavailability.


What do people using Hbase (live) + mapreduce typically do. ?

Thanks!
Chinmay

Hbase + mapreduce -- operational design question

Reply via email to