Currently it is just something I expect to run into problems with as I
am yet some ways from going load testing though I'd hope to get started
on it soon. The 0.21 planned implementation of MultiDelete will
certainly help a lot though.
Perhaps running a M/R job with a scan result as the input that deletes a
range on each task could be an efficient way to do these kinds of mass
deletes?
On 04/03/2010 01:26 AM, Jonathan Gray wrote:
Juhani,
Deletes are really special versions of Puts (so they are equally fast). I
suppose it would be possible to have some kind of special filter that issued
deletes server-side but seems dangerous :) That's beyond even the notion of
stateful scanners which are tricky as is.
MultiDelete would actually process those deletes in parallel, concurrently running
across all the servers, so is a bit more than just List<Delete> under the
covers. Or at least that's the intention, I don't think it's built.
Are you running into performance issues doing the deletes currently, or are you
just expecting to run into problems? I would think that if it was taking too
long to run from a sequential client, a parallel MultiDelete would solve your
problems.
JG
-----Original Message-----
From: Juhani Connolly [mailto:juh...@ninja.co.jp]
Sent: Thursday, April 01, 2010 10:44 PM
To: hbase-user@hadoop.apache.org
Subject: Efficient mass deletes
Having an issue with table design regarding how to delete old/obsolete
data.
I have raw names in a non-time sorted manner, id first followed by
timestamp, the main objective being running big scans on specific id's
from time x to time y.
However this data builds up at a respectable rate and I need a method
to
delete old records en masse. I considered using the ttl parameter on
the
column families, but the current plan is to selectively store data for
a
longer time for specific id's.
Are there any plans to link a delete operation with a scanner(so delete
range x-y, or if you supply a filter, delete when conditions p and q
are
met).
If not what would be the recommended method to handle these kind of
batch deletes?
The current JIRA for MultiDelete (
http://issues.apache.org/jira/browse/HBASE-1845 ) simply implements
deleting on a List<Delete>, which still seems limited.
Is the only way to do this to run a scan, and then build a List from
that to use with the multi call discussed in HBASE-1845? This feels
very
inefficient but please correct me if I'm mistaken. Current activity
estimate is about 10million rows a day, generating about 300million
cells, which would need to be deleted on a regular basis(so 300mil
cells
every day or 2.1bil once a week)