I would strongly recommend against "invent your own mode", and instead
using standardized schemes/modes (e.g. XTS).

Separate from that, I don't understand the reasoning to do it at the
codec level. seems quite a bit more messy and complicated than the
alternatives, such as block device level (e.g. dm-crypt), or
filesystem level (e.g. ext4 filesystem encryption), which have the
advantage of the filesystem cache actually working.


On Wed, Jan 6, 2016 at 4:19 AM, Renaud Delbru <[email protected]> wrote:
> Dear all,
>
> We would like to contribute a codec that enables the encryption of sensitive
> data in the index that has been developed as part of an engagement with a
> customer. We think that this could be of interest for the community. If that
> is the case, I’ll open a JIRA ticket and upload a first version of the
> patch. We are also looking for feedbacks on the approach.
>
> Below is a description of the project.
>
> = Introduction
>
> In comparison with approaches where all data is encrypted (e.g., file system
> encryption, index output / directory encryption), encryption at a codec
> level enables more fine-grained control on which block of data is encrypted.
> This is more efficient since less data has to be encrypted. This also gives
> more flexibility such as the ability to select which field to encrypt.
>
> Some of the requirements for this project were:
>
> - The performance impact of the encryption should be reasonable.
> - The user can choose which field to encrypt.
> - Key management: During the life cycle of the index, the user can provide a
> new version of his encryption key. Multiple key versions should co-exist in
> one index.
>
> = What is supported ?
>
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest
> key version available.
>
> = How it is implemented ?
>
> == Key Management
>
> One index segment is encrypted with a single key version. An index can have
> multiple segments, each one encrypted using a different key version. The key
> version for a segment is stored in the segment info.
>
> The provided codec is abstract, and a subclass is responsible in providing
> an implementation of the cipher factory. The cipher factory is responsible
> of the creation of a cipher instance based on a given key version.
>
> == Encryption Model
>
> The encryption model is based on AES/CBC with padding. Initialisation vector
> (IV) is reused for performance reason, but only on a per format and per
> segment basis.
>
> While IV reuse is usually considered a bad practice, the CBC mode is somehow
> resilient to IV reuse. The only "leak" of information that this could lead
> to is being able to know that two encrypted blocks of data starts with the
> same prefix. However, it is unlikely that two data blocks in an index
> segment will start with the same data:
>
> - Stored Fields Format: Each encrypted data block is a compressed block
> (~4kb) of one or more documents. It is unlikely that two compressed blocks
> start with the same data prefix.
>
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of
> terms and payloads from one or more documents. It is unlikely that two
> compressed blocks start with the same data prefix.
>
> - Term Dictionary Index: The term dictionary index is encoded and encrypted
> in one single data block.
>
> - Term Dictionary Data: Each data block of the term dictionary encodes a set
> of suffixes. It is unlikely to have two dictionary data blocks sharing the
> same prefix within the same segment.
>
> - DocValues: A DocValues file will be composed of multiple encrypted data
> blocks. It is unlikely to have two data blocks sharing the same prefix
> within the same segment (each one will encodes a list of values associated
> to a field).
>
> To the best of our knowledge, this model should be safe. However, it would
> be good if someone with security expertise in the community could review and
> validate it.
>
> = Performance
>
> We report here a performance benchmark we did on an early prototype based on
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all
> the fields (id, title, body, date) were encrypted. Only the block tree terms
> and compressed stored fields format were tested at that time.
>
> == Indexing
>
> The indexing throughput slightly decreased and is roughly 15% less than with
> the base Lucene.
>
> The merge time slightly increased by 35%.
>
> There was no significant difference in term of index size.
>
> == Query Throughput
>
> With respect to query throughput, we observed no significant impact on the
> following queries: Term query, boolean query, phrase query, numeric range
> query.
>
> We observed the following performance impact for queries that needs to scan
> a larger portion of the term dictionary:
>
> - prefix query: decrease of ~25%
> - wildcard query (e.g., “fu*r”): decrease of ~60%
> - fuzzy query (distance 1): decrease of ~40%
> - fuzzy query (distance 2): decrease of ~80%
>
> We can see that the decrease of performance is relative to the size of the
> dictionary scan.
>
> == Document Retrieval
>
> We observed a decrease of performance that is relative to the size of the
> set of documents to be retrieved:
>
> - ~20% when retrieving a medium set of documents (100)
> - ~30/40% when retrieving a large set of documents (1000)
>
> = Known Limitations
>
> - compressed stored field do not keep order of fields since non-encrypted
> and encrypted fields are stored in separated blocks.
>
> - the current implementation of the cipher factory does not enforce the use
> of AES/CBC. We are planning to add this to the final version of the patch.
>
> - the current implementation does not change the IV per segment. We are
> planning to add this to the final version of the patch.
>
> - the current implementation of compressed stored fields decrypts a full
> compressed block even if a small portion is decompressed (high impact when
> storing very small documents). We are planning to add this optimisation to
> the final version of the patch. The overall document retrieval performance
> might increase with this optimisation.
>
> The codec has been implemented as a contrib. Given that most of the classes
> were final, we had to copy most of the original code from the extended
> formats. At a later stage, we could think of opening some of these classes
> to extend them properly in order to reduce code duplication and simplify
> code maintenance.
>
> --
> Renaud Delbru
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to