Mike Yoder created HADOOP-16522:
-----------------------------------

             Summary: Encrypt buffered data on disk
                 Key: HADOOP-16522
                 URL: https://issues.apache.org/jira/browse/HADOOP-16522
             Project: Hadoop Common
          Issue Type: Sub-task
            Reporter: Mike Yoder


This came out of discussions with [~ste...@apache.org], [~irashid] and 
[~vanzin].

Imran:
{quote}
Steve pointed out to me that the s3 libraries buffer data to disk.  This is 
pretty much arbitrary user data.
 
Spark has some settings to encrypt data that it writes to local disk (shuffle 
files etc.).  Spark never has control of what arbitrary libraries are doing 
with data, so it doesn't guarantee that nothing ever ends up on disk -- but to 
the end user, they'd view those s3 libraries as part of the same system.  So if 
a user is turning on spark's local-disk encryption, the users would be pretty 
surprised to find out that the data they're writing to S3 ends up on 
local-disk, unencrypted.
{quote}

Me:
{quote}
... Regardless, this is still an s3a bug.
{quote}
 
Steve:
{quote}
I disagree

we need to save intermediate data "somewhere" -people get a choice of disk or 
memory.

encrypting data on disk was never considered as needed, on the basis that 
anyone malicious with read access under your home dir could lift the hadoop 
token file which YARN provides and so have full R/W access to all your data in 
the cluster filesystems until those tokens expire. If you don't have a good 
story there then the buffering of a few tens of MB of data during upload is a 
detail. 

There's also the extra complication that when uploading file blocks, we pass in 
the filename to the AWS SDK and let it do the uploads, rather than create the 
output stream; the SDK code has, in the past, been better at recovering 
failures there than output stream + mark and reset. that was a while back; 
things may change. But it is why I'd prefer any encrypted temp store as a new 
buffer option, rather than just silently change the "disk" buffer option to 
encrypt

Be interesting to see where else in the code this needs to be addressed; I'd 
recommend looking at all uses if org.apache.hadoop.fs.LocalDirAllocator and 
making sure that Spark YARN launch+execute didn't use this indirectly

JIRAs under HADOOP-15620 welcome; do look at the test policy in the hadoop-aws 
docs; we'd need a new subclass of AbstractSTestS3AHugeFiles for integration 
testing a different buffering option, plus whatever unit tests the encryption 
itself needed.
{quote}

Me:
{quote}
I get it. But ... there are a couple of subtleties here. One is that the tokens 
expire, while the data is still data. (This might or might not matter, 
depending on the threat...) Another is that customer policies in this area do 
not always align well with common sense. There are blanket policies like "data 
shall never be written to disk unencrypted" which we have come up against, 
which we'd like to be able to honestly answer in the affirmative.  We have 
encrypted MR shuffle as one historical example, and encrypted impala memory 
spills as another.
{quote}




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to