Thank you so much for all the answers guys, looks like I should write up something for the ref guide!
J-D On Sun, Jun 23, 2013 at 3:31 PM, Andrew Purtell <[email protected]> wrote: > Right, compaction followed by a 'secure' HDFS level delete by random > rewrites to nuke the blocks of the remnant. Even then it's difficult to say > something recoverable does not remain, though in practical terms the > hypothetical user here could be assured no API of HBase or HDFS could > ever retrieve the data. > > Or burn the platters to ash. > > On Sunday, June 23, 2013, Ian Varley wrote: > >> One more followup on this, after talking to some security-types: >> >> - The issue isn't wiping out all data for a customer; it's wiping out >> *specific* data. Using the "forget an encryption key" method would then >> mean separate encryption keys per row, which isn't feasible most of the >> time. (Consider information that becomes classified but didn't used to be, >> for example.) >> - In some cases, decryption can still happen without keys, by brute force >> or from finding weaknesses in the algorithms down the road. Yes, I know >> that the brute force CPU time is measured in eons, but never say never; we >> can easily decrypt things now that were encrypted with the best available >> algorithms and keys 40 years ago. :) >> >> So for cases where it counts, a "secure delete" means no less than writing >> over the data with random strings. It would be interesting to add features >> to HBase / HDFS that passed muster for stuff like this; for example, an >> HDFS secure-delete< >> http://www.ghacks.net/2010/08/26/securely-delete-files-with-secure-delete/> >> command, and an HBase secure-delete that does all of: add delete marker, >> force major compaction, and run HDFS secure-delete. >> >> Ian >> >> On Jun 20, 2013, at 7:39 AM, Jean-Marc Spaggiari wrote: >> >> Correct, that's another way. Just need to have one encryption key per >> customer. And all what is written into HBase, over all the tables, is >> encrypted with that key. >> >> If the customer want to have all its data erased, just erased the key, >> and you have no way to retrieve anything from HBase even if it's still >> into all the tables. So now you can emit all the deletes required, and >> that will be totally deleted on the next regular major compaction... >> >> There will be a small impact on regular reads/write since you will >> need to read the key first, but them a user delete will be way more >> efficient. >> >> >> 2013/6/20 lars hofhansl <[email protected] <javascript:;><mailto: >> [email protected] <javascript:;>>>: >> IMHO the "proper" of doing such things is encryption. >> >> 0-ing the values or even overwriting with a pattern typically leaves >> traces of the old data on a magnetic platter that can be retrieved with >> proper forensics. (Secure erase of SSD is typically pretty secure, though). >> >> >> For such use cases, files (HFiles) should be encrypted and the decryption >> keys should just be forgotten at the appropriate times. >> I realize that for J-D's specific use case doing this at the HFile level >> would be very difficult. >> >> Maybe the KVs' values could be stored encrypted with a user specific key. >> Deleting the user's data then means to forget that users key. >> >> >> -- Lars >> >> ________________________________ >> From: Matt Corgan <[email protected] <javascript:;><mailto: >> [email protected] <javascript:;>>> >> To: dev <[email protected] >> <javascript:;><mailto:[email protected]<javascript:;> >> >> >> Sent: Wednesday, June 19, 2013 2:15 PM >> Subject: Re: Efficiently wiping out random data? >> >> >> Would it be possible to zero-out all the value bytes for cells in existing >> HFiles? They keys would remain, but if you knew that ahead of time you >> could design your keys so they don't contain important info. >> >> >> On Wed, Jun 19, 2013 at 11:28 AM, Ian Varley <[email protected] >> <mailto:[email protected]>> wrote: >> >> At least in some cases, the answer to that question ("do you even have to >> destroy your tapes?") is a resounding "yes". For some extreme cases (think >> health care, privacy, etc), companies do all RDBMS backups to disk instead >> of tape for that reason. (Transaction logs are considered different, I >> guess because they're inherently transient? Who knows.) >> >> The "no time travel" fix doesn't work, because you could still change that >> code or ACL in the future and get back to the data. In these cases, one >> must provably destroy the data. >> >> That said, forcing full compactions (especially if they can be targeted >> via stripes or levels or something) is an OK way to handle it, maybe >> eventually with more ways to nice it down so it doesn't hose your cluster. >> >> Ian >> >> On Jun 19, 2013, at 11:27 AM, Todd Lipcon wrote: >> >> I'd also question what exactly the regulatory requirements for deletion >> are. For example, if you had tape backups of your Oracle DB, would you have >> to drive to your off-site storage facility, grab every tape you ever made, >> and zero out the user's data as well? I doubt it, considering tapes have >> basically the same storage characteristics as HDFS in terms of inability to >> random write. >> >> Another example: deletes work the same way in most databases -- eg in >> postgres, deletion of a record just consists of setting a record's "xmax" >> column to the current transaction ID. This is equivalent to a tombstone, >> and you have to wait for a VACUUM process to come along and actually delete >> the record entry. In Oracle, the record will persist in a rollback segment >> for a configurable amount of time, and you can use a Flashback query to >> time travel and see it again. In Vertica, you also set an "xmax" entry and >> wait until the next merge-out (like a major compaction). >> >> Even in a filesystem, deletion doesn't typically remove data, unless you >> use a tool like srm. It just unlinks the inode from the directory tree. >> >> So, if any of the above systems satisfy their use case, then HBase ought to >> as well. Perhaps there's an ACL we could add which would allow/disallow >> users from doing time travel more than N seconds in the past.. maybe that >> would help allay fears? >> >> -Todd >> >> On Wed, Jun 19, 2013 at 8:12 AM, Jesse Yates <[email protected] >> <mailto:[email protected]> >> <mailto:[email protected]>>wrote: >> >> Chances are that date isn't completely "random". For instance, with a user >> they are likely to have an id in their row key, so doing a filtering (with >> a custom scanner) major compaction would clean that up. With Sergey's >> compaction stuff coming in you could break that out even further and only >> have to compact a small set of files to get that removal. >> >> So it's hard, but as its not our direct use case, it's gonna be a few extra >> hoops. >> >> On Wednesday, June 19, 2013, Kevin O'dell wrote: >> >> Yeah, the immutable nature of HDFS is biting us here. >> >> >> On Wed, Jun 19, 2013 at 8:46 AM, Jean-Daniel Cryans <[email protected] >> <mailto:[email protected]> >> <mailto:[email protected]> >> <javascript:;> >> wrote: >> >> That sounds like a very effective way for developers to kill clusters >> with compactions :) >> >> J-D >> >> On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell < >> [email protected]<javascript:;> >> >> wrote: >> JD, >> >> What about adding a flag for the delete, something like -full or >> -true(it is early). Once we issue the delete to the proper >> row/region >> we >> run a flush, then execute a single region major compaction. That >> way, >> if >> it is a single record, or a subset of data the impact is minimal. If >> the >> delete happens to hit every region we will compact every region(not >> ideal). >> Another thought would be an overwrite > > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White)
