[ 
https://issues.apache.org/jira/browse/HADOOP-16522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-16522:
------------------------------------
    Parent Issue: HADOOP-17566  (was: HADOOP-16829)

> Encrypt S3A buffered data on disk
> ---------------------------------
>
>                 Key: HADOOP-16522
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16522
>             Project: Hadoop Common
>          Issue Type: Sub-task
>            Reporter: Mike Yoder
>            Priority: Major
>
> This came out of discussions with [[email protected]], [~irashid] and 
> [~vanzin].
> Imran:
> {quote}
> Steve pointed out to me that the s3 libraries buffer data to disk.  This is 
> pretty much arbitrary user data.
>  
> Spark has some settings to encrypt data that it writes to local disk (shuffle 
> files etc.).  Spark never has control of what arbitrary libraries are doing 
> with data, so it doesn't guarantee that nothing ever ends up on disk -- but 
> to the end user, they'd view those s3 libraries as part of the same system.  
> So if a user is turning on spark's local-disk encryption, the users would be 
> pretty surprised to find out that the data they're writing to S3 ends up on 
> local-disk, unencrypted.
> {quote}
> Me:
> {quote}
> ... Regardless, this is still an s3a bug.
> {quote}
>  
> Steve:
> {quote}
> I disagree
> we need to save intermediate data "somewhere" -people get a choice of disk or 
> memory.
> encrypting data on disk was never considered as needed, on the basis that 
> anyone malicious with read access under your home dir could lift the hadoop 
> token file which YARN provides and so have full R/W access to all your data 
> in the cluster filesystems until those tokens expire. If you don't have a 
> good story there then the buffering of a few tens of MB of data during upload 
> is a detail. 
> There's also the extra complication that when uploading file blocks, we pass 
> in the filename to the AWS SDK and let it do the uploads, rather than create 
> the output stream; the SDK code has, in the past, been better at recovering 
> failures there than output stream + mark and reset. that was a while back; 
> things may change. But it is why I'd prefer any encrypted temp store as a new 
> buffer option, rather than just silently change the "disk" buffer option to 
> encrypt
> Be interesting to see where else in the code this needs to be addressed; I'd 
> recommend looking at all uses if org.apache.hadoop.fs.LocalDirAllocator and 
> making sure that Spark YARN launch+execute didn't use this indirectly
> JIRAs under HADOOP-15620 welcome; do look at the test policy in the 
> hadoop-aws docs; we'd need a new subclass of AbstractSTestS3AHugeFiles for 
> integration testing a different buffering option, plus whatever unit tests 
> the encryption itself needed.
> {quote}
> Me:
> {quote}
> I get it. But ... there are a couple of subtleties here. One is that the 
> tokens expire, while the data is still data. (This might or might not matter, 
> depending on the threat...) Another is that customer policies in this area do 
> not always align well with common sense. There are blanket policies like 
> "data shall never be written to disk unencrypted" which we have come up 
> against, which we'd like to be able to honestly answer in the affirmative.  
> We have encrypted MR shuffle as one historical example, and encrypted impala 
> memory spills as another.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to