RE: Efficient mass deletes

Jonathan Gray Mon, 05 Apr 2010 09:43:36 -0700

I believe the current TableOutputFormat supports both Puts and Deletes, so this 
functionality is already available in an MR context.


> -----Original Message-----
> From: Juhani Connolly [mailto:juh...@ninja.co.jp]
> Sent: Sunday, April 04, 2010 11:14 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Efficient mass deletes
> 
> Currently it is just something I expect to run into problems with as I
> am yet some ways from going load testing though I'd hope to get started
> on it soon. The 0.21 planned implementation of MultiDelete will
> certainly help a lot though.
> Perhaps running a M/R job with a scan result as the input that deletes
> a
> range on each task could be an efficient way to do these kinds of mass
> deletes?
> 
> On 04/03/2010 01:26 AM, Jonathan Gray wrote:
> > Juhani,
> >
> > Deletes are really special versions of Puts (so they are equally
> fast).  I suppose it would be possible to have some kind of special
> filter that issued deletes server-side but seems dangerous :)  That's
> beyond even the notion of stateful scanners which are tricky as is.
> >
> > MultiDelete would actually process those deletes in parallel,
> concurrently running across all the servers, so is a bit more than just
> List<Delete>  under the covers.  Or at least that's the intention, I
> don't think it's built.
> >
> > Are you running into performance issues doing the deletes currently,
> or are you just expecting to run into problems?  I would think that if
> it was taking too long to run from a sequential client, a parallel
> MultiDelete would solve your problems.
> >
> > JG
> >
> >
> >> -----Original Message-----
> >> From: Juhani Connolly [mailto:juh...@ninja.co.jp]
> >> Sent: Thursday, April 01, 2010 10:44 PM
> >> To: hbase-user@hadoop.apache.org
> >> Subject: Efficient mass deletes
> >>
> >> Having an issue with table design regarding how to delete
> old/obsolete
> >> data.
> >>
> >> I have raw names in a non-time sorted manner, id first followed by
> >> timestamp, the main objective being running big scans on specific
> id's
> >> from time x to time y.
> >>
> >> However this data builds up at a respectable rate and I need a
> method
> >> to
> >> delete old records en masse. I considered using the ttl parameter on
> >> the
> >> column families, but the current plan is to selectively store data
> for
> >> a
> >> longer time for specific id's.
> >>
> >> Are there any plans to link a delete operation with a scanner(so
> delete
> >> range x-y, or if you supply a filter, delete when conditions p and q
> >> are
> >> met).
> >>
> >> If not what would be the recommended method to handle these kind of
> >> batch deletes?
> >> The current JIRA for MultiDelete (
> >> http://issues.apache.org/jira/browse/HBASE-1845 )  simply implements
> >> deleting on a List<Delete>, which still seems limited.
> >>
> >> Is the only way to do this to run a scan, and then build a List from
> >> that to use with the multi call discussed in HBASE-1845? This feels
> >> very
> >> inefficient but please correct me if I'm mistaken. Current activity
> >> estimate is about 10million rows a day, generating about 300million
> >> cells, which would need to be deleted on a regular basis(so 300mil
> >> cells
> >> every day or 2.1bil once a week)
> >>
> >

RE: Efficient mass deletes

Reply via email to