Re: Contribution: Codec for index-level encryption

Jack Krupansky Thu, 07 Jan 2016 07:01:44 -0800

Be sure to add that comment about multi-tenancy to the Jira description
since that is a key aspect of this particular approach.


-- Jack Krupansky

On Thu, Jan 7, 2016 at 4:52 AM, Renaud Delbru <renaud@siren.solutions>
wrote:

> Hi Robert,
>
> Yes, you are right. This approach is more complex than plain fs level
> encryption, but this enables more fine-grained control on what is
> encrypted. For example, it would not be possible to choose which field to
> encrypt or not. Also, with fs level encryption, all the data is encrypted
> regardless if it is sensitive or not. For example, in such a scenario, the
> full posting lists will be encrypted which is unnecessary, and you'll pay
> the cost of encrypting the posting lists.
> It is true that if the filesystem caches unencrypted pages, then with a
> warm cache you will likely get better performance. However, this also means
> that most of the index data will reside in memory in an unencrypted form.
> If the server is compromised, then this will make life easier for the
> attacker. You have also the (small) issue with the swap which can end up
> with a large portion of the index unencrypted. This can be solved by using
> an encrypted swap, but this means that the data is now encrypted using a
> unique key and not a per-user key. Also, this adds complexity in the
> management of the system.
> Highly sensitive installations can make the trade-off between performance
> and security. There are some applications for Solr that are not served by
> the other approaches.
>
> This codec was developed in the context of a large multi-tenant
> architecture, where each user has its own index / collection. Each user has
> its own key, and can update his key at any time.
> While it seems it would be possible with ext4 to handle a per-user key
> (e.g., one key per directory), it makes the key and index management more
> complex (especially in SolrCloud). This is not adequate for some
> environments.
> Also, it does not allow the management of multiple key versions in one
> index. If the user changes his key, we have to re-encrypt the full
> directory which is not acceptable wrt performance for some environments.
>
> The codec level encryption approach is more adequate for some environments
> than the fs level encryption approach. Also, it is to be noted that this
> codec does not affect the rest of Lucene/Solr. Users will be able to choose
> which approach is more adequate for their environment. This gives more
> options to Lucene/Solr users.
>
> P.S.: I have created the issue LUCENE-6966 and move the discussion there,
> as it is more simple for external people to participate to the discussions.
>
> Regards
> --
> Renaud Delbru
>
>
> On 06/01/16 15:32, Robert Muir wrote:
>
>> I would strongly recommend against "invent your own mode", and instead
>> using standardized schemes/modes (e.g. XTS).
>>
>> Separate from that, I don't understand the reasoning to do it at the
>> codec level. seems quite a bit more messy and complicated than the
>> alternatives, such as block device level (e.g. dm-crypt), or
>> filesystem level (e.g. ext4 filesystem encryption), which have the
>> advantage of the filesystem cache actually working.
>>
>>
>> On Wed, Jan 6, 2016 at 4:19 AM, Renaud Delbru <renaud@siren.solutions>
>> wrote:
>>
>>> Dear all,
>>>
>>> We would like to contribute a codec that enables the encryption of
>>> sensitive
>>> data in the index that has been developed as part of an engagement with a
>>> customer. We think that this could be of interest for the community. If
>>> that
>>> is the case, I’ll open a JIRA ticket and upload a first version of the
>>> patch. We are also looking for feedbacks on the approach.
>>>
>>> Below is a description of the project.
>>>
>>> = Introduction
>>>
>>> In comparison with approaches where all data is encrypted (e.g., file
>>> system
>>> encryption, index output / directory encryption), encryption at a codec
>>> level enables more fine-grained control on which block of data is
>>> encrypted.
>>> This is more efficient since less data has to be encrypted. This also
>>> gives
>>> more flexibility such as the ability to select which field to encrypt.
>>>
>>> Some of the requirements for this project were:
>>>
>>> - The performance impact of the encryption should be reasonable.
>>> - The user can choose which field to encrypt.
>>> - Key management: During the life cycle of the index, the user can
>>> provide a
>>> new version of his encryption key. Multiple key versions should co-exist
>>> in
>>> one index.
>>>
>>> = What is supported ?
>>>
>>> - Block tree terms index and dictionary
>>> - Compressed stored fields format
>>> - Compressed term vectors format
>>> - Doc values format (prototype based on an encrypted index output) - this
>>> will be submitted as a separated patch
>>> - Index upgrader: command to upgrade all the index segments with the
>>> latest
>>> key version available.
>>>
>>> = How it is implemented ?
>>>
>>> == Key Management
>>>
>>> One index segment is encrypted with a single key version. An index can
>>> have
>>> multiple segments, each one encrypted using a different key version. The
>>> key
>>> version for a segment is stored in the segment info.
>>>
>>> The provided codec is abstract, and a subclass is responsible in
>>> providing
>>> an implementation of the cipher factory. The cipher factory is
>>> responsible
>>> of the creation of a cipher instance based on a given key version.
>>>
>>> == Encryption Model
>>>
>>> The encryption model is based on AES/CBC with padding. Initialisation
>>> vector
>>> (IV) is reused for performance reason, but only on a per format and per
>>> segment basis.
>>>
>>> While IV reuse is usually considered a bad practice, the CBC mode is
>>> somehow
>>> resilient to IV reuse. The only "leak" of information that this could
>>> lead
>>> to is being able to know that two encrypted blocks of data starts with
>>> the
>>> same prefix. However, it is unlikely that two data blocks in an index
>>> segment will start with the same data:
>>>
>>> - Stored Fields Format: Each encrypted data block is a compressed block
>>> (~4kb) of one or more documents. It is unlikely that two compressed
>>> blocks
>>> start with the same data prefix.
>>>
>>> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of
>>> terms and payloads from one or more documents. It is unlikely that two
>>> compressed blocks start with the same data prefix.
>>>
>>> - Term Dictionary Index: The term dictionary index is encoded and
>>> encrypted
>>> in one single data block.
>>>
>>> - Term Dictionary Data: Each data block of the term dictionary encodes a
>>> set
>>> of suffixes. It is unlikely to have two dictionary data blocks sharing
>>> the
>>> same prefix within the same segment.
>>>
>>> - DocValues: A DocValues file will be composed of multiple encrypted data
>>> blocks. It is unlikely to have two data blocks sharing the same prefix
>>> within the same segment (each one will encodes a list of values
>>> associated
>>> to a field).
>>>
>>> To the best of our knowledge, this model should be safe. However, it
>>> would
>>> be good if someone with security expertise in the community could review
>>> and
>>> validate it.
>>>
>>> = Performance
>>>
>>> We report here a performance benchmark we did on an early prototype
>>> based on
>>> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where
>>> all
>>> the fields (id, title, body, date) were encrypted. Only the block tree
>>> terms
>>> and compressed stored fields format were tested at that time.
>>>
>>> == Indexing
>>>
>>> The indexing throughput slightly decreased and is roughly 15% less than
>>> with
>>> the base Lucene.
>>>
>>> The merge time slightly increased by 35%.
>>>
>>> There was no significant difference in term of index size.
>>>
>>> == Query Throughput
>>>
>>> With respect to query throughput, we observed no significant impact on
>>> the
>>> following queries: Term query, boolean query, phrase query, numeric range
>>> query.
>>>
>>> We observed the following performance impact for queries that needs to
>>> scan
>>> a larger portion of the term dictionary:
>>>
>>> - prefix query: decrease of ~25%
>>> - wildcard query (e.g., “fu*r”): decrease of ~60%
>>> - fuzzy query (distance 1): decrease of ~40%
>>> - fuzzy query (distance 2): decrease of ~80%
>>>
>>> We can see that the decrease of performance is relative to the size of
>>> the
>>> dictionary scan.
>>>
>>> == Document Retrieval
>>>
>>> We observed a decrease of performance that is relative to the size of the
>>> set of documents to be retrieved:
>>>
>>> - ~20% when retrieving a medium set of documents (100)
>>> - ~30/40% when retrieving a large set of documents (1000)
>>>
>>> = Known Limitations
>>>
>>> - compressed stored field do not keep order of fields since non-encrypted
>>> and encrypted fields are stored in separated blocks.
>>>
>>> - the current implementation of the cipher factory does not enforce the
>>> use
>>> of AES/CBC. We are planning to add this to the final version of the
>>> patch.
>>>
>>> - the current implementation does not change the IV per segment. We are
>>> planning to add this to the final version of the patch.
>>>
>>> - the current implementation of compressed stored fields decrypts a full
>>> compressed block even if a small portion is decompressed (high impact
>>> when
>>> storing very small documents). We are planning to add this optimisation
>>> to
>>> the final version of the patch. The overall document retrieval
>>> performance
>>> might increase with this optimisation.
>>>
>>> The codec has been implemented as a contrib. Given that most of the
>>> classes
>>> were final, we had to copy most of the original code from the extended
>>> formats. At a later stage, we could think of opening some of these
>>> classes
>>> to extend them properly in order to reduce code duplication and simplify
>>> code maintenance.
>>>
>>> --
>>> Renaud Delbru
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Contribution: Codec for index-level encryption

Reply via email to