[ https://issues.apache.org/jira/browse/HDFS-5143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13758596#comment-13758596 ]
Yi Liu commented on HDFS-5143: ------------------------------ Steve, Thanks for your comments. >>> Is there going to be a difference between the listable length of a file >>> (FileSystem.listStatus(), and the user-code visible length of a file The user will see no difference between these two in our design choice, and they will be the same length as original file. As you know, for most encryption modes of various encryption algorithms, the length of cipher text is different from the length of original plain text. But in our design, the length of cipher text is the same length as plain text, more importantly, the bytes have 1:1 correspondence . To make the encryption more secure, we use different IV(Initialization Vector) in encryption algorithm, and IV is fixed size of 16bytes. We store the IV at the header of encrypted file, so Length of encrypted file = Length of original file + 16 bytes. However, we will implement listStatus/getFileStatus and other related interfaces of FileSystem in CFS to ensure the length returned is always the original length of the file. The key point is that length of encrypted file equals length of plain text file + 16bytes, the bytes have 1:1 correspondence, and our design allows a random access property during decryption. So we can easily get the length of plain text file and easily handle other operations of file system. Actually, if we put “encryption” flag and IV in namenode, then length of encrypted file equals to length of plain text file. That will be great for HDFS, but many people may not like the idea of modification to namenode inodes and code. Furthermore, CFS can decorate other file system besides HDFS, so we are proposing not to modify structure of namenode. >>> Is it that the cfs:// view is consistent across all file stat operations, >>> seek() etc.? Right, it’s consistent. They are regard to plain text file, since upper layer applications should be unaware of encryption which is transparent. Furthermore, for du, df and other related commands of file system, since Length of encrypted file = Length of original file + 16bytes, “du” will count the plain text file size, and it’s consistent with the file size listed in “ls”, but “df” e.g. will count the encrypted file size. >>> I’m curious about how this interacts with quotas. This is a good question. HDFS Quotas includes Name Quotas and Space Quotas. We just need to discuss Space Quotas, as described above, length of encrypted file equals length of plain text file + 16 bytes, so the required space of encrypted directory is a bit larger than unencrypted directory, but I don’t think this affects usage, when copying a file from unencrypted directory to an encrypted one, if space quotas is not enough and the copying directory contains encrypted file, we will prompt with a message like “The directory contains encrypted file, since 16 additional bytes are required per encrypted file, the space quota for the target directory is insufficient”. >>> Are all operations that are atomic today, e.g. renaming one directory under >>> another going to remain atomic? It depends. If renaming one directory under another, and both the source and target are unencrypted directory, then the operations are still atomic. However, we do not intend to allow renaming an unencrypted directory to encrypted one, instead, user should create the encrypted directory first and then copy files to it. > Hadoop cryptographic file system > -------------------------------- > > Key: HDFS-5143 > URL: https://issues.apache.org/jira/browse/HDFS-5143 > Project: Hadoop HDFS > Issue Type: New Feature > Components: security > Affects Versions: 3.0.0 > Reporter: Yi Liu > Labels: rhino > Fix For: 3.0.0 > > Attachments: HADOOP cryptographic file system.pdf > > > There is an increasing need for securing data when Hadoop customers use > various upper layer applications, such as Map-Reduce, Hive, Pig, HBase and so > on. > HADOOP CFS (HADOOP Cryptographic File System) is used to secure data, based > on HADOOP “FilterFileSystem” decorating DFS or other file systems, and > transparent to upper layer applications. It’s configurable, scalable and fast. > High level requirements: > 1. Transparent to and no modification required for upper layer > applications. > 2. “Seek”, “PositionedReadable” are supported for input stream of CFS if > the wrapped file system supports them. > 3. Very high performance for encryption and decryption, they will not > become bottleneck. > 4. Can decorate HDFS and all other file systems in Hadoop, and will not > modify existing structure of file system, such as namenode and datanode > structure if the wrapped file system is HDFS. > 5. Admin can configure encryption policies, such as which directory will > be encrypted. > 6. A robust key management framework. > 7. Support Pread and append operations if the wrapped file system supports > them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira