Juhani, Deletes are really special versions of Puts (so they are equally fast). I suppose it would be possible to have some kind of special filter that issued deletes server-side but seems dangerous :) That's beyond even the notion of stateful scanners which are tricky as is.
MultiDelete would actually process those deletes in parallel, concurrently running across all the servers, so is a bit more than just List<Delete> under the covers. Or at least that's the intention, I don't think it's built. Are you running into performance issues doing the deletes currently, or are you just expecting to run into problems? I would think that if it was taking too long to run from a sequential client, a parallel MultiDelete would solve your problems. JG > -----Original Message----- > From: Juhani Connolly [mailto:juh...@ninja.co.jp] > Sent: Thursday, April 01, 2010 10:44 PM > To: hbase-user@hadoop.apache.org > Subject: Efficient mass deletes > > Having an issue with table design regarding how to delete old/obsolete > data. > > I have raw names in a non-time sorted manner, id first followed by > timestamp, the main objective being running big scans on specific id's > from time x to time y. > > However this data builds up at a respectable rate and I need a method > to > delete old records en masse. I considered using the ttl parameter on > the > column families, but the current plan is to selectively store data for > a > longer time for specific id's. > > Are there any plans to link a delete operation with a scanner(so delete > range x-y, or if you supply a filter, delete when conditions p and q > are > met). > > If not what would be the recommended method to handle these kind of > batch deletes? > The current JIRA for MultiDelete ( > http://issues.apache.org/jira/browse/HBASE-1845 ) simply implements > deleting on a List<Delete>, which still seems limited. > > Is the only way to do this to run a scan, and then build a List from > that to use with the multi call discussed in HBASE-1845? This feels > very > inefficient but please correct me if I'm mistaken. Current activity > estimate is about 10million rows a day, generating about 300million > cells, which would need to be deleted on a regular basis(so 300mil > cells > every day or 2.1bil once a week)