[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477634#comment-13477634
 ] 

Jerry Chen commented on MAPREDUCE-4491:
---------------------------------------

Hi Benoy,
I am Haifeng from Intel. and we was discussing offline as to this feature. And 
I really apperciate your initiation of this work. And we also see the 
importance of encryption and decryption in Hadoop when we are deasling with 
sensitive data. 

Just as you pointed out, the functionalities requirements are more or less 
same. For hadoop community, we wish to get a high level abstraction that 
basically provide a foundation for these requirements in different hadoop 
components (such as HDFS, MapReduce, HBase) while enable different 
implementations such as different encryption algorithms or different ways of 
key management of different parts / companies so that not bounding a concept on 
a specific implementation.  Just as we disuccssed offline, the driving force 
for such a abstraction is summarized  as following:

1. Encryption and decryption need to be supported in different components and 
usage models. For example, We may use HDFS Client API and Codec directly to 
encrypt and decrypt HDFS file; We may use MapReduce to processing a encrypted 
file and output a encrypted file; And also, the HBase may needs to store its 
files (such as hfiles) in an encrypted way.

2. The community may have different implemenation of encryption codecs and 
different ways of providing keys. CompressionCodec provides us a foundation for 
related work. But CompressionCodec are not enough for encryption and decryption 
because CompressionCodec assumes to initilize from hadoop Configuration while 
encryption/decryption may needs a per file crypto context such as the Key. With 
an abtraction layer of crypto, we can share the common featurs such as "Provide 
different keys for different input files of a MapReduce job." other than each 
implementation get his own way in MapReduce core and finally becames into a 
mess.

Based on these driving forces, your work done and our offline discussions, we 
refined our work and would like to propose the following,

1. For Hadoop common, a new CryptoCodec interface which extends 
CompressionCodec, which adding the methods of 
getCryptoContext/setCryptoContext. Just as CompressionCodec, it will initialize 
its global settings from Configuration. But CryptoCodec will receive its crypto 
context (the Key, for example) through CryptoContext object setting by 
setCryptoContext, allowing different usage cases such as "direct use 
CryptoCodec to encrypt/decrypt a HDFS file by direct providing the 
CryptoContext(Key)" or "Map Reduce way of using CryptoCodec that a 
CryptoContext(Key) is choosed per file based on some policy".

Any specific crypto implementation are under this umbrella and will implement 
CryptoCodec. The PGPCodec is pretty good fit into a implementation of 
CryptoCodec. And we also are able to implements our splittable CryptoCodec.

2. For MapReduce, use CryptoContextProvider interface to abstract 
implementation specific service and allowing the MapReduce core is able to 
written shared code of retrieveing the CryptoContext of a specific file from a 
CryptoContextProvider and pass to the CryptoCodec in using. Different 
CryptoContextProvider implementations can implement different ways of deciding 
the CryptoContext and different ways of retrieving Keys from different Key 
Stores. We can provide basic and common implementations of 
CryptoContextProviders such as "A CryptoContextProvider provides CryptoContext 
for a file by regular expression matching the file path and get the key from a 
java KeyStore" while not preventing users to implement or extends their own if 
existing implementation doesn't satisfy their requirements.

CryptoContextProvider configurations are passed by hadoop JobConfig and 
credentials (credential secret keys) and the implementation of 
CryptoContextProvider can choose whether or not to encrypt the secret keys 
stored in job Credentials.

I attched the java files of these interfaces and basic strucutes in Attachments 
section for demonstrating the concepts and I wish to have a design document for 
these high level things when we have enough discussion and come to an agreement.

Again, thanks for your patient and time. 

                
> Encryption and Key Protection
> -----------------------------
>
>                 Key: MAPREDUCE-4491
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: documentation, security, task-controller, tasktracker
>            Reporter: Benoy Antony
>            Assignee: Benoy Antony
>         Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf
>
>
> When dealing with sensitive data, it is required to keep the data encrypted 
> wherever it is stored. Common use case is to pull encrypted data out of a 
> datasource and store in HDFS for analysis. The keys are stored in an external 
> keystore. 
> The feature adds a customizable framework to integrate different types of 
> keystores, support for Java KeyStore, read keys from keystores, and transport 
> keys from JobClient to Tasks.
> The feature adds PGP encryption as a codec and additional utilities to 
> perform encryption related steps.
> The design document is attached. It explains the requirement, design and use 
> cases.
> Kindly review and comment. Collaboration is very much welcome.
> I have a tested patch for this for 1.1 and will upload it soon as an initial 
> work for further refinement.
> Update: The patches are uploaded to subtasks. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to