Be sure to add that comment about multi-tenancy to the Jira description since that is a key aspect of this particular approach.
-- Jack Krupansky On Thu, Jan 7, 2016 at 4:52 AM, Renaud Delbru <renaud@siren.solutions> wrote: > Hi Robert, > > Yes, you are right. This approach is more complex than plain fs level > encryption, but this enables more fine-grained control on what is > encrypted. For example, it would not be possible to choose which field to > encrypt or not. Also, with fs level encryption, all the data is encrypted > regardless if it is sensitive or not. For example, in such a scenario, the > full posting lists will be encrypted which is unnecessary, and you'll pay > the cost of encrypting the posting lists. > It is true that if the filesystem caches unencrypted pages, then with a > warm cache you will likely get better performance. However, this also means > that most of the index data will reside in memory in an unencrypted form. > If the server is compromised, then this will make life easier for the > attacker. You have also the (small) issue with the swap which can end up > with a large portion of the index unencrypted. This can be solved by using > an encrypted swap, but this means that the data is now encrypted using a > unique key and not a per-user key. Also, this adds complexity in the > management of the system. > Highly sensitive installations can make the trade-off between performance > and security. There are some applications for Solr that are not served by > the other approaches. > > This codec was developed in the context of a large multi-tenant > architecture, where each user has its own index / collection. Each user has > its own key, and can update his key at any time. > While it seems it would be possible with ext4 to handle a per-user key > (e.g., one key per directory), it makes the key and index management more > complex (especially in SolrCloud). This is not adequate for some > environments. > Also, it does not allow the management of multiple key versions in one > index. If the user changes his key, we have to re-encrypt the full > directory which is not acceptable wrt performance for some environments. > > The codec level encryption approach is more adequate for some environments > than the fs level encryption approach. Also, it is to be noted that this > codec does not affect the rest of Lucene/Solr. Users will be able to choose > which approach is more adequate for their environment. This gives more > options to Lucene/Solr users. > > P.S.: I have created the issue LUCENE-6966 and move the discussion there, > as it is more simple for external people to participate to the discussions. > > Regards > -- > Renaud Delbru > > > On 06/01/16 15:32, Robert Muir wrote: > >> I would strongly recommend against "invent your own mode", and instead >> using standardized schemes/modes (e.g. XTS). >> >> Separate from that, I don't understand the reasoning to do it at the >> codec level. seems quite a bit more messy and complicated than the >> alternatives, such as block device level (e.g. dm-crypt), or >> filesystem level (e.g. ext4 filesystem encryption), which have the >> advantage of the filesystem cache actually working. >> >> >> On Wed, Jan 6, 2016 at 4:19 AM, Renaud Delbru <renaud@siren.solutions> >> wrote: >> >>> Dear all, >>> >>> We would like to contribute a codec that enables the encryption of >>> sensitive >>> data in the index that has been developed as part of an engagement with a >>> customer. We think that this could be of interest for the community. If >>> that >>> is the case, I’ll open a JIRA ticket and upload a first version of the >>> patch. We are also looking for feedbacks on the approach. >>> >>> Below is a description of the project. >>> >>> = Introduction >>> >>> In comparison with approaches where all data is encrypted (e.g., file >>> system >>> encryption, index output / directory encryption), encryption at a codec >>> level enables more fine-grained control on which block of data is >>> encrypted. >>> This is more efficient since less data has to be encrypted. This also >>> gives >>> more flexibility such as the ability to select which field to encrypt. >>> >>> Some of the requirements for this project were: >>> >>> - The performance impact of the encryption should be reasonable. >>> - The user can choose which field to encrypt. >>> - Key management: During the life cycle of the index, the user can >>> provide a >>> new version of his encryption key. Multiple key versions should co-exist >>> in >>> one index. >>> >>> = What is supported ? >>> >>> - Block tree terms index and dictionary >>> - Compressed stored fields format >>> - Compressed term vectors format >>> - Doc values format (prototype based on an encrypted index output) - this >>> will be submitted as a separated patch >>> - Index upgrader: command to upgrade all the index segments with the >>> latest >>> key version available. >>> >>> = How it is implemented ? >>> >>> == Key Management >>> >>> One index segment is encrypted with a single key version. An index can >>> have >>> multiple segments, each one encrypted using a different key version. The >>> key >>> version for a segment is stored in the segment info. >>> >>> The provided codec is abstract, and a subclass is responsible in >>> providing >>> an implementation of the cipher factory. The cipher factory is >>> responsible >>> of the creation of a cipher instance based on a given key version. >>> >>> == Encryption Model >>> >>> The encryption model is based on AES/CBC with padding. Initialisation >>> vector >>> (IV) is reused for performance reason, but only on a per format and per >>> segment basis. >>> >>> While IV reuse is usually considered a bad practice, the CBC mode is >>> somehow >>> resilient to IV reuse. The only "leak" of information that this could >>> lead >>> to is being able to know that two encrypted blocks of data starts with >>> the >>> same prefix. However, it is unlikely that two data blocks in an index >>> segment will start with the same data: >>> >>> - Stored Fields Format: Each encrypted data block is a compressed block >>> (~4kb) of one or more documents. It is unlikely that two compressed >>> blocks >>> start with the same data prefix. >>> >>> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of >>> terms and payloads from one or more documents. It is unlikely that two >>> compressed blocks start with the same data prefix. >>> >>> - Term Dictionary Index: The term dictionary index is encoded and >>> encrypted >>> in one single data block. >>> >>> - Term Dictionary Data: Each data block of the term dictionary encodes a >>> set >>> of suffixes. It is unlikely to have two dictionary data blocks sharing >>> the >>> same prefix within the same segment. >>> >>> - DocValues: A DocValues file will be composed of multiple encrypted data >>> blocks. It is unlikely to have two data blocks sharing the same prefix >>> within the same segment (each one will encodes a list of values >>> associated >>> to a field). >>> >>> To the best of our knowledge, this model should be safe. However, it >>> would >>> be good if someone with security expertise in the community could review >>> and >>> validate it. >>> >>> = Performance >>> >>> We report here a performance benchmark we did on an early prototype >>> based on >>> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where >>> all >>> the fields (id, title, body, date) were encrypted. Only the block tree >>> terms >>> and compressed stored fields format were tested at that time. >>> >>> == Indexing >>> >>> The indexing throughput slightly decreased and is roughly 15% less than >>> with >>> the base Lucene. >>> >>> The merge time slightly increased by 35%. >>> >>> There was no significant difference in term of index size. >>> >>> == Query Throughput >>> >>> With respect to query throughput, we observed no significant impact on >>> the >>> following queries: Term query, boolean query, phrase query, numeric range >>> query. >>> >>> We observed the following performance impact for queries that needs to >>> scan >>> a larger portion of the term dictionary: >>> >>> - prefix query: decrease of ~25% >>> - wildcard query (e.g., “fu*r”): decrease of ~60% >>> - fuzzy query (distance 1): decrease of ~40% >>> - fuzzy query (distance 2): decrease of ~80% >>> >>> We can see that the decrease of performance is relative to the size of >>> the >>> dictionary scan. >>> >>> == Document Retrieval >>> >>> We observed a decrease of performance that is relative to the size of the >>> set of documents to be retrieved: >>> >>> - ~20% when retrieving a medium set of documents (100) >>> - ~30/40% when retrieving a large set of documents (1000) >>> >>> = Known Limitations >>> >>> - compressed stored field do not keep order of fields since non-encrypted >>> and encrypted fields are stored in separated blocks. >>> >>> - the current implementation of the cipher factory does not enforce the >>> use >>> of AES/CBC. We are planning to add this to the final version of the >>> patch. >>> >>> - the current implementation does not change the IV per segment. We are >>> planning to add this to the final version of the patch. >>> >>> - the current implementation of compressed stored fields decrypts a full >>> compressed block even if a small portion is decompressed (high impact >>> when >>> storing very small documents). We are planning to add this optimisation >>> to >>> the final version of the patch. The overall document retrieval >>> performance >>> might increase with this optimisation. >>> >>> The codec has been implemented as a contrib. Given that most of the >>> classes >>> were final, we had to copy most of the original code from the extended >>> formats. At a later stage, we could think of opening some of these >>> classes >>> to extend them properly in order to reduce code duplication and simplify >>> code maintenance. >>> >>> -- >>> Renaud Delbru >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >