Chances are that date isn't completely "random". For instance, with a user they are likely to have an id in their row key, so doing a filtering (with a custom scanner) major compaction would clean that up. With Sergey's compaction stuff coming in you could break that out even further and only have to compact a small set of files to get that removal.
So it's hard, but as its not our direct use case, it's gonna be a few extra hoops. On Wednesday, June 19, 2013, Kevin O'dell wrote: > Yeah, the immutable nature of HDFS is biting us here. > > > On Wed, Jun 19, 2013 at 8:46 AM, Jean-Daniel Cryans > <[email protected]<javascript:;> > >wrote: > > > That sounds like a very effective way for developers to kill clusters > > with compactions :) > > > > J-D > > > > On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell > > <[email protected]<javascript:;> > > > > wrote: > > > JD, > > > > > > What about adding a flag for the delete, something like -full or > > > -true(it is early). Once we issue the delete to the proper row/region > we > > > run a flush, then execute a single region major compaction. That way, > if > > > it is a single record, or a subset of data the impact is minimal. If > the > > > delete happens to hit every region we will compact every region(not > > ideal). > > > Another thought would be an overwrite, but with versions this logic > > > becomes more complicated. > > > > > > > > > On Wed, Jun 19, 2013 at 8:31 AM, Jean-Daniel Cryans < > [email protected] <javascript:;> > > >wrote: > > > > > >> Hey devs, > > >> > > >> I was presenting at GOTO Amsterdam yesterday and I got a question > > >> about a scenario that I've never thought about before. I'm wondering > > >> what others think. > > >> > > >> How do you efficiently wipe out random data in HBase? > > >> > > >> For example, you have a website and a user asks you to close their > > >> account and get rid of the data. > > >> > > >> Would you say "sure can do, lemme just issue a couple of Deletes!" and > > >> call it a day? What if you really have to delete the data, not just > > >> mask it, because of contractual obligations or local laws? > > >> > > >> Major compacting is the obvious solution but it seems really > > >> inefficient. Let's say you've got some truly random data to delete and > > >> it happens so that you have at least one row per region to get rid > > >> of... then you need to basically rewrite the whole table? > > >> > > >> My answer was such, and I told the attendee that it's not an easy use > > >> case to manage in HBase. > > >> > > >> Thoughts? > > >> > > >> J-D > > >> > > > > > > > > > > > > -- > > > Kevin O'Dell > > > Systems Engineer, Cloudera > > > > > > -- > Kevin O'Dell > Systems Engineer, Cloudera > -- ------------------- Jesse Yates @jesse_yates jyates.github.com
