[ https://issues.apache.org/jira/browse/MAPREDUCE-4491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477634#comment-13477634 ]
Jerry Chen commented on MAPREDUCE-4491: --------------------------------------- Hi Benoy, I am Haifeng from Intel. and we was discussing offline as to this feature. And I really apperciate your initiation of this work. And we also see the importance of encryption and decryption in Hadoop when we are deasling with sensitive data. Just as you pointed out, the functionalities requirements are more or less same. For hadoop community, we wish to get a high level abstraction that basically provide a foundation for these requirements in different hadoop components (such as HDFS, MapReduce, HBase) while enable different implementations such as different encryption algorithms or different ways of key management of different parts / companies so that not bounding a concept on a specific implementation. Just as we disuccssed offline, the driving force for such a abstraction is summarized as following: 1. Encryption and decryption need to be supported in different components and usage models. For example, We may use HDFS Client API and Codec directly to encrypt and decrypt HDFS file; We may use MapReduce to processing a encrypted file and output a encrypted file; And also, the HBase may needs to store its files (such as hfiles) in an encrypted way. 2. The community may have different implemenation of encryption codecs and different ways of providing keys. CompressionCodec provides us a foundation for related work. But CompressionCodec are not enough for encryption and decryption because CompressionCodec assumes to initilize from hadoop Configuration while encryption/decryption may needs a per file crypto context such as the Key. With an abtraction layer of crypto, we can share the common featurs such as "Provide different keys for different input files of a MapReduce job." other than each implementation get his own way in MapReduce core and finally becames into a mess. Based on these driving forces, your work done and our offline discussions, we refined our work and would like to propose the following, 1. For Hadoop common, a new CryptoCodec interface which extends CompressionCodec, which adding the methods of getCryptoContext/setCryptoContext. Just as CompressionCodec, it will initialize its global settings from Configuration. But CryptoCodec will receive its crypto context (the Key, for example) through CryptoContext object setting by setCryptoContext, allowing different usage cases such as "direct use CryptoCodec to encrypt/decrypt a HDFS file by direct providing the CryptoContext(Key)" or "Map Reduce way of using CryptoCodec that a CryptoContext(Key) is choosed per file based on some policy". Any specific crypto implementation are under this umbrella and will implement CryptoCodec. The PGPCodec is pretty good fit into a implementation of CryptoCodec. And we also are able to implements our splittable CryptoCodec. 2. For MapReduce, use CryptoContextProvider interface to abstract implementation specific service and allowing the MapReduce core is able to written shared code of retrieveing the CryptoContext of a specific file from a CryptoContextProvider and pass to the CryptoCodec in using. Different CryptoContextProvider implementations can implement different ways of deciding the CryptoContext and different ways of retrieving Keys from different Key Stores. We can provide basic and common implementations of CryptoContextProviders such as "A CryptoContextProvider provides CryptoContext for a file by regular expression matching the file path and get the key from a java KeyStore" while not preventing users to implement or extends their own if existing implementation doesn't satisfy their requirements. CryptoContextProvider configurations are passed by hadoop JobConfig and credentials (credential secret keys) and the implementation of CryptoContextProvider can choose whether or not to encrypt the secret keys stored in job Credentials. I attched the java files of these interfaces and basic strucutes in Attachments section for demonstrating the concepts and I wish to have a design document for these high level things when we have enough discussion and come to an agreement. Again, thanks for your patient and time. > Encryption and Key Protection > ----------------------------- > > Key: MAPREDUCE-4491 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4491 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: documentation, security, task-controller, tasktracker > Reporter: Benoy Antony > Assignee: Benoy Antony > Attachments: Hadoop_Encryption.pdf, Hadoop_Encryption.pdf > > > When dealing with sensitive data, it is required to keep the data encrypted > wherever it is stored. Common use case is to pull encrypted data out of a > datasource and store in HDFS for analysis. The keys are stored in an external > keystore. > The feature adds a customizable framework to integrate different types of > keystores, support for Java KeyStore, read keys from keystores, and transport > keys from JobClient to Tasks. > The feature adds PGP encryption as a codec and additional utilities to > perform encryption related steps. > The design document is attached. It explains the requirement, design and use > cases. > Kindly review and comment. Collaboration is very much welcome. > I have a tested patch for this for 1.1 and will upload it soon as an initial > work for further refinement. > Update: The patches are uploaded to subtasks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira