Re: Efficient mass deletes

Juhani Connolly Sun, 04 Apr 2010 23:12:50 -0700

Currently it is just something I expect to run into problems with as Iam yet some ways from going load testing though I'd hope to get startedon it soon. The 0.21 planned implementation of MultiDelete willcertainly help a lot though.Perhaps running a M/R job with a scan result as the input that deletes arange on each task could be an efficient way to do these kinds of massdeletes?


On 04/03/2010 01:26 AM, Jonathan Gray wrote:

Juhani,


Deletes are really special versions of Puts (so they are equally fast).  I 
suppose it would be possible to have some kind of special filter that issued 
deletes server-side but seems dangerous :)  That's beyond even the notion of 
stateful scanners which are tricky as is.

MultiDelete would actually process those deletes in parallel, concurrently running 
across all the servers, so is a bit more than just List<Delete>  under the 
covers.  Or at least that's the intention, I don't think it's built.

Are you running into performance issues doing the deletes currently, or are you 
just expecting to run into problems?  I would think that if it was taking too 
long to run from a sequential client, a parallel MultiDelete would solve your 
problems.

JG

-----Original Message-----
From: Juhani Connolly [mailto:juh...@ninja.co.jp]
Sent: Thursday, April 01, 2010 10:44 PM
To: hbase-user@hadoop.apache.org
Subject: Efficient mass deletes

Having an issue with table design regarding how to delete old/obsolete
data.

I have raw names in a non-time sorted manner, id first followed by
timestamp, the main objective being running big scans on specific id's
from time x to time y.

However this data builds up at a respectable rate and I need a method
to
delete old records en masse. I considered using the ttl parameter on
the
column families, but the current plan is to selectively store data for
a
longer time for specific id's.

Are there any plans to link a delete operation with a scanner(so delete
range x-y, or if you supply a filter, delete when conditions p and q
are
met).

If not what would be the recommended method to handle these kind of
batch deletes?
The current JIRA for MultiDelete (
http://issues.apache.org/jira/browse/HBASE-1845 )  simply implements
deleting on a List<Delete>, which still seems limited.

Is the only way to do this to run a scan, and then build a List from
that to use with the multi call discussed in HBASE-1845? This feels
very
inefficient but please correct me if I'm mistaken. Current activity
estimate is about 10million rows a day, generating about 300million
cells, which would need to be deleted on a regular basis(so 300mil
cells
every day or 2.1bil once a week)

Re: Efficient mass deletes

Reply via email to