I believe the current TableOutputFormat supports both Puts and Deletes, so this functionality is already available in an MR context.
> -----Original Message----- > From: Juhani Connolly [mailto:juh...@ninja.co.jp] > Sent: Sunday, April 04, 2010 11:14 PM > To: hbase-user@hadoop.apache.org > Subject: Re: Efficient mass deletes > > Currently it is just something I expect to run into problems with as I > am yet some ways from going load testing though I'd hope to get started > on it soon. The 0.21 planned implementation of MultiDelete will > certainly help a lot though. > Perhaps running a M/R job with a scan result as the input that deletes > a > range on each task could be an efficient way to do these kinds of mass > deletes? > > On 04/03/2010 01:26 AM, Jonathan Gray wrote: > > Juhani, > > > > Deletes are really special versions of Puts (so they are equally > fast). I suppose it would be possible to have some kind of special > filter that issued deletes server-side but seems dangerous :) That's > beyond even the notion of stateful scanners which are tricky as is. > > > > MultiDelete would actually process those deletes in parallel, > concurrently running across all the servers, so is a bit more than just > List<Delete> under the covers. Or at least that's the intention, I > don't think it's built. > > > > Are you running into performance issues doing the deletes currently, > or are you just expecting to run into problems? I would think that if > it was taking too long to run from a sequential client, a parallel > MultiDelete would solve your problems. > > > > JG > > > > > >> -----Original Message----- > >> From: Juhani Connolly [mailto:juh...@ninja.co.jp] > >> Sent: Thursday, April 01, 2010 10:44 PM > >> To: hbase-user@hadoop.apache.org > >> Subject: Efficient mass deletes > >> > >> Having an issue with table design regarding how to delete > old/obsolete > >> data. > >> > >> I have raw names in a non-time sorted manner, id first followed by > >> timestamp, the main objective being running big scans on specific > id's > >> from time x to time y. > >> > >> However this data builds up at a respectable rate and I need a > method > >> to > >> delete old records en masse. I considered using the ttl parameter on > >> the > >> column families, but the current plan is to selectively store data > for > >> a > >> longer time for specific id's. > >> > >> Are there any plans to link a delete operation with a scanner(so > delete > >> range x-y, or if you supply a filter, delete when conditions p and q > >> are > >> met). > >> > >> If not what would be the recommended method to handle these kind of > >> batch deletes? > >> The current JIRA for MultiDelete ( > >> http://issues.apache.org/jira/browse/HBASE-1845 ) simply implements > >> deleting on a List<Delete>, which still seems limited. > >> > >> Is the only way to do this to run a scan, and then build a List from > >> that to use with the multi call discussed in HBASE-1845? This feels > >> very > >> inefficient but please correct me if I'm mistaken. Current activity > >> estimate is about 10million rows a day, generating about 300million > >> cells, which would need to be deleted on a regular basis(so 300mil > >> cells > >> every day or 2.1bil once a week) > >> > >