[
https://issues.apache.org/jira/browse/HADOOP-16522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Loughran updated HADOOP-16522:
------------------------------------
Parent Issue: HADOOP-17566 (was: HADOOP-16829)
> Encrypt S3A buffered data on disk
> ---------------------------------
>
> Key: HADOOP-16522
> URL: https://issues.apache.org/jira/browse/HADOOP-16522
> Project: Hadoop Common
> Issue Type: Sub-task
> Reporter: Mike Yoder
> Priority: Major
>
> This came out of discussions with [[email protected]], [~irashid] and
> [~vanzin].
> Imran:
> {quote}
> Steve pointed out to me that the s3 libraries buffer data to disk. This is
> pretty much arbitrary user data.
>
> Spark has some settings to encrypt data that it writes to local disk (shuffle
> files etc.). Spark never has control of what arbitrary libraries are doing
> with data, so it doesn't guarantee that nothing ever ends up on disk -- but
> to the end user, they'd view those s3 libraries as part of the same system.
> So if a user is turning on spark's local-disk encryption, the users would be
> pretty surprised to find out that the data they're writing to S3 ends up on
> local-disk, unencrypted.
> {quote}
> Me:
> {quote}
> ... Regardless, this is still an s3a bug.
> {quote}
>
> Steve:
> {quote}
> I disagree
> we need to save intermediate data "somewhere" -people get a choice of disk or
> memory.
> encrypting data on disk was never considered as needed, on the basis that
> anyone malicious with read access under your home dir could lift the hadoop
> token file which YARN provides and so have full R/W access to all your data
> in the cluster filesystems until those tokens expire. If you don't have a
> good story there then the buffering of a few tens of MB of data during upload
> is a detail.
> There's also the extra complication that when uploading file blocks, we pass
> in the filename to the AWS SDK and let it do the uploads, rather than create
> the output stream; the SDK code has, in the past, been better at recovering
> failures there than output stream + mark and reset. that was a while back;
> things may change. But it is why I'd prefer any encrypted temp store as a new
> buffer option, rather than just silently change the "disk" buffer option to
> encrypt
> Be interesting to see where else in the code this needs to be addressed; I'd
> recommend looking at all uses if org.apache.hadoop.fs.LocalDirAllocator and
> making sure that Spark YARN launch+execute didn't use this indirectly
> JIRAs under HADOOP-15620 welcome; do look at the test policy in the
> hadoop-aws docs; we'd need a new subclass of AbstractSTestS3AHugeFiles for
> integration testing a different buffering option, plus whatever unit tests
> the encryption itself needed.
> {quote}
> Me:
> {quote}
> I get it. But ... there are a couple of subtleties here. One is that the
> tokens expire, while the data is still data. (This might or might not matter,
> depending on the threat...) Another is that customer policies in this area do
> not always align well with common sense. There are blanket policies like
> "data shall never be written to disk unencrypted" which we have come up
> against, which we'd like to be able to honestly answer in the affirmative.
> We have encrypted MR shuffle as one historical example, and encrypted impala
> memory spills as another.
> {quote}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]