I would strongly recommend against "invent your own mode", and instead using standardized schemes/modes (e.g. XTS).
Separate from that, I don't understand the reasoning to do it at the codec level. seems quite a bit more messy and complicated than the alternatives, such as block device level (e.g. dm-crypt), or filesystem level (e.g. ext4 filesystem encryption), which have the advantage of the filesystem cache actually working. On Wed, Jan 6, 2016 at 4:19 AM, Renaud Delbru <[email protected]> wrote: > Dear all, > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. If that > is the case, I’ll open a JIRA ticket and upload a first version of the > patch. We are also looking for feedbacks on the approach. > > Below is a description of the project. > > = Introduction > > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec > level enables more fine-grained control on which block of data is encrypted. > This is more efficient since less data has to be encrypted. This also gives > more flexibility such as the ability to select which field to encrypt. > > Some of the requirements for this project were: > > - The performance impact of the encryption should be reasonable. > - The user can choose which field to encrypt. > - Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > > = What is supported ? > > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > > = How it is implemented ? > > == Key Management > > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > > The provided codec is abstract, and a subclass is responsible in providing > an implementation of the cipher factory. The cipher factory is responsible > of the creation of a cipher instance based on a given key version. > > == Encryption Model > > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead > to is being able to know that two encrypted blocks of data starts with the > same prefix. However, it is unlikely that two data blocks in an index > segment will start with the same data: > > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix > within the same segment (each one will encodes a list of values associated > to a field). > > To the best of our knowledge, this model should be safe. However, it would > be good if someone with security expertise in the community could review and > validate it. > > = Performance > > We report here a performance benchmark we did on an early prototype based on > Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all > the fields (id, title, body, date) were encrypted. Only the block tree terms > and compressed stored fields format were tested at that time. > > == Indexing > > The indexing throughput slightly decreased and is roughly 15% less than with > the base Lucene. > > The merge time slightly increased by 35%. > > There was no significant difference in term of index size. > > == Query Throughput > > With respect to query throughput, we observed no significant impact on the > following queries: Term query, boolean query, phrase query, numeric range > query. > > We observed the following performance impact for queries that needs to scan > a larger portion of the term dictionary: > > - prefix query: decrease of ~25% > - wildcard query (e.g., “fu*r”): decrease of ~60% > - fuzzy query (distance 1): decrease of ~40% > - fuzzy query (distance 2): decrease of ~80% > > We can see that the decrease of performance is relative to the size of the > dictionary scan. > > == Document Retrieval > > We observed a decrease of performance that is relative to the size of the > set of documents to be retrieved: > > - ~20% when retrieving a medium set of documents (100) > - ~30/40% when retrieving a large set of documents (1000) > > = Known Limitations > > - compressed stored field do not keep order of fields since non-encrypted > and encrypted fields are stored in separated blocks. > > - the current implementation of the cipher factory does not enforce the use > of AES/CBC. We are planning to add this to the final version of the patch. > > - the current implementation does not change the IV per segment. We are > planning to add this to the final version of the patch. > > - the current implementation of compressed stored fields decrypts a full > compressed block even if a small portion is decompressed (high impact when > storing very small documents). We are planning to add this optimisation to > the final version of the patch. The overall document retrieval performance > might increase with this optimisation. > > The codec has been implemented as a contrib. Given that most of the classes > were final, we had to copy most of the original code from the extended > formats. At a later stage, we could think of opening some of these classes > to extend them properly in order to reduce code duplication and simplify > code maintenance. > > -- > Renaud Delbru > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
