Re: Efficiently wiping out random data?

Jean-Daniel Cryans Mon, 24 Jun 2013 10:59:40 -0700

Thank you so much for all the answers guys, looks like I should write
up something for the ref guide!


J-D

On Sun, Jun 23, 2013 at 3:31 PM, Andrew Purtell <[email protected]> wrote:
> Right, compaction followed by a 'secure' HDFS level delete by random
> rewrites to nuke the blocks of the remnant. Even then it's difficult to say
> something recoverable does not remain, though in practical terms the
> hypothetical user here could be assured no API of HBase or HDFS could
> ever retrieve the data.
>
> Or burn the platters to ash.
>
> On Sunday, June 23, 2013, Ian Varley wrote:
>
>> One more followup on this, after talking to some security-types:
>>
>>  - The issue isn't wiping out all data for a customer; it's wiping out
>> *specific* data. Using the "forget an encryption key" method would then
>> mean separate encryption keys per row, which isn't feasible most of the
>> time. (Consider information that becomes classified but didn't used to be,
>> for example.)
>>  - In some cases, decryption can still happen without keys, by brute force
>> or from finding weaknesses in the algorithms down the road. Yes, I know
>> that the brute force CPU time is measured in eons, but never say never; we
>> can easily decrypt things now that were encrypted with the best available
>> algorithms and keys 40 years ago. :)
>>
>> So for cases where it counts, a "secure delete" means no less than writing
>> over the data with random strings. It would be interesting to add features
>> to HBase / HDFS that passed muster for stuff like this; for example, an
>> HDFS secure-delete<
>> http://www.ghacks.net/2010/08/26/securely-delete-files-with-secure-delete/>
>> command, and an HBase secure-delete that does all of: add delete marker,
>> force major compaction, and run HDFS secure-delete.
>>
>> Ian
>>
>> On Jun 20, 2013, at 7:39 AM, Jean-Marc Spaggiari wrote:
>>
>> Correct, that's another way. Just need to have one encryption key per
>> customer. And all what is written into HBase, over all the tables, is
>> encrypted with that key.
>>
>> If the customer want to have all its data erased, just erased the key,
>> and you have no way to retrieve anything from HBase even if it's still
>> into all the tables. So now you can emit all the deletes required, and
>> that will be totally deleted on the next regular major compaction...
>>
>> There will be a small impact on regular reads/write since you will
>> need to read the key first, but them a user delete will be way more
>> efficient.
>>
>>
>> 2013/6/20 lars hofhansl <[email protected] <javascript:;><mailto:
>> [email protected] <javascript:;>>>:
>> IMHO the "proper" of doing such things is encryption.
>>
>> 0-ing the values or even overwriting with a pattern typically leaves
>> traces of the old data on a magnetic platter that can be retrieved with
>> proper forensics. (Secure erase of SSD is typically pretty secure, though).
>>
>>
>> For such use cases, files (HFiles) should be encrypted and the decryption
>> keys should just be forgotten at the appropriate times.
>> I realize that for J-D's specific use case doing this at the HFile level
>> would be very difficult.
>>
>> Maybe the KVs' values could be stored encrypted with a user specific key.
>> Deleting the user's data then means to forget that users key.
>>
>>
>> -- Lars
>>
>> ________________________________
>> From: Matt Corgan <[email protected] <javascript:;><mailto:
>> [email protected] <javascript:;>>>
>> To: dev <[email protected] 
>> <javascript:;><mailto:[email protected]<javascript:;>
>> >>
>> Sent: Wednesday, June 19, 2013 2:15 PM
>> Subject: Re: Efficiently wiping out random data?
>>
>>
>> Would it be possible to zero-out all the value bytes for cells in existing
>> HFiles?  They keys would remain, but if you knew that ahead of time you
>> could design your keys so they don't contain important info.
>>
>>
>> On Wed, Jun 19, 2013 at 11:28 AM, Ian Varley <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> At least in some cases, the answer to that question ("do you even have to
>> destroy your tapes?") is a resounding "yes". For some extreme cases (think
>> health care, privacy, etc), companies do all RDBMS backups to disk instead
>> of tape for that reason. (Transaction logs are considered different, I
>> guess because they're inherently transient? Who knows.)
>>
>> The "no time travel" fix doesn't work, because you could still change that
>> code or ACL in the future and get back to the data. In these cases, one
>> must provably destroy the data.
>>
>> That said, forcing full compactions (especially if they can be targeted
>> via stripes or levels or something) is an OK way to handle it, maybe
>> eventually with more ways to nice it down so it doesn't hose your cluster.
>>
>> Ian
>>
>> On Jun 19, 2013, at 11:27 AM, Todd Lipcon wrote:
>>
>> I'd also question what exactly the regulatory requirements for deletion
>> are. For example, if you had tape backups of your Oracle DB, would you have
>> to drive to your off-site storage facility, grab every tape you ever made,
>> and zero out the user's data as well? I doubt it, considering tapes have
>> basically the same storage characteristics as HDFS in terms of inability to
>> random write.
>>
>> Another example: deletes work the same way in most databases -- eg in
>> postgres, deletion of a record just consists of setting a record's "xmax"
>> column to the current transaction ID. This is equivalent to a tombstone,
>> and you have to wait for a VACUUM process to come along and actually delete
>> the record entry. In Oracle, the record will persist in a rollback segment
>> for a configurable amount of time, and you can use a Flashback query to
>> time travel and see it again. In Vertica, you also set an "xmax" entry and
>> wait until the next merge-out (like a major compaction).
>>
>> Even in a filesystem, deletion doesn't typically remove data, unless you
>> use a tool like srm. It just unlinks the inode from the directory tree.
>>
>> So, if any of the above systems satisfy their use case, then HBase ought to
>> as well. Perhaps there's an ACL we could add which would allow/disallow
>> users from doing time travel more than N seconds in the past..  maybe that
>> would help allay fears?
>>
>> -Todd
>>
>> On Wed, Jun 19, 2013 at 8:12 AM, Jesse Yates <[email protected]
>> <mailto:[email protected]>
>> <mailto:[email protected]>>wrote:
>>
>> Chances are that date isn't completely "random". For instance, with a user
>> they are likely to have an id in their row key, so doing a filtering (with
>> a custom scanner) major compaction would clean that up. With Sergey's
>> compaction stuff coming in you could break that out even further and only
>> have to compact a small set of files to get that removal.
>>
>> So it's hard, but as its not our direct use case, it's gonna be a few extra
>> hoops.
>>
>> On Wednesday, June 19, 2013, Kevin O'dell wrote:
>>
>> Yeah, the immutable nature of HDFS is biting us here.
>>
>>
>> On Wed, Jun 19, 2013 at 8:46 AM, Jean-Daniel Cryans <[email protected]
>> <mailto:[email protected]>
>> <mailto:[email protected]>
>> <javascript:;>
>> wrote:
>>
>> That sounds like a very effective way for developers to kill clusters
>> with compactions :)
>>
>> J-D
>>
>> On Wed, Jun 19, 2013 at 2:39 PM, Kevin O'dell <
>> [email protected]<javascript:;>
>>
>> wrote:
>> JD,
>>
>>  What about adding a flag for the delete, something like -full or
>> -true(it is early).  Once we issue the delete to the proper
>> row/region
>> we
>> run a flush, then execute a single region major compaction.  That
>> way,
>> if
>> it is a single record, or a subset of data the impact is minimal.  If
>> the
>> delete happens to hit every region we will compact every region(not
>> ideal).
>> Another thought would be an overwrite
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

Re: Efficiently wiping out random data?

Reply via email to