Currently it is just something I expect to run into problems with as I am yet some ways from going load testing though I'd hope to get started on it soon. The 0.21 planned implementation of MultiDelete will certainly help a lot though. Perhaps running a M/R job with a scan result as the input that deletes a range on each task could be an efficient way to do these kinds of mass deletes?

On 04/03/2010 01:26 AM, Jonathan Gray wrote:
Juhani,

Deletes are really special versions of Puts (so they are equally fast).  I 
suppose it would be possible to have some kind of special filter that issued 
deletes server-side but seems dangerous :)  That's beyond even the notion of 
stateful scanners which are tricky as is.

MultiDelete would actually process those deletes in parallel, concurrently running 
across all the servers, so is a bit more than just List<Delete>  under the 
covers.  Or at least that's the intention, I don't think it's built.

Are you running into performance issues doing the deletes currently, or are you 
just expecting to run into problems?  I would think that if it was taking too 
long to run from a sequential client, a parallel MultiDelete would solve your 
problems.

JG

-----Original Message-----
From: Juhani Connolly [mailto:juh...@ninja.co.jp]
Sent: Thursday, April 01, 2010 10:44 PM
To: hbase-user@hadoop.apache.org
Subject: Efficient mass deletes

Having an issue with table design regarding how to delete old/obsolete
data.

I have raw names in a non-time sorted manner, id first followed by
timestamp, the main objective being running big scans on specific id's
from time x to time y.

However this data builds up at a respectable rate and I need a method
to
delete old records en masse. I considered using the ttl parameter on
the
column families, but the current plan is to selectively store data for
a
longer time for specific id's.

Are there any plans to link a delete operation with a scanner(so delete
range x-y, or if you supply a filter, delete when conditions p and q
are
met).

If not what would be the recommended method to handle these kind of
batch deletes?
The current JIRA for MultiDelete (
http://issues.apache.org/jira/browse/HBASE-1845 )  simply implements
deleting on a List<Delete>, which still seems limited.

Is the only way to do this to run a scan, and then build a List from
that to use with the multi call discussed in HBASE-1845? This feels
very
inefficient but please correct me if I'm mistaken. Current activity
estimate is about 10million rows a day, generating about 300million
cells, which would need to be deleted on a regular basis(so 300mil
cells
every day or 2.1bil once a week)

Reply via email to