Hi Robert,

Yes, you are right. This approach is more complex than plain fs level encryption, but this enables more fine-grained control on what is encrypted. For example, it would not be possible to choose which field to encrypt or not. Also, with fs level encryption, all the data is encrypted regardless if it is sensitive or not. For example, in such a scenario, the full posting lists will be encrypted which is unnecessary, and you'll pay the cost of encrypting the posting lists. It is true that if the filesystem caches unencrypted pages, then with a warm cache you will likely get better performance. However, this also means that most of the index data will reside in memory in an unencrypted form. If the server is compromised, then this will make life easier for the attacker. You have also the (small) issue with the swap which can end up with a large portion of the index unencrypted. This can be solved by using an encrypted swap, but this means that the data is now encrypted using a unique key and not a per-user key. Also, this adds complexity in the management of the system. Highly sensitive installations can make the trade-off between performance and security. There are some applications for Solr that are not served by the other approaches.

This codec was developed in the context of a large multi-tenant architecture, where each user has its own index / collection. Each user has its own key, and can update his key at any time. While it seems it would be possible with ext4 to handle a per-user key (e.g., one key per directory), it makes the key and index management more complex (especially in SolrCloud). This is not adequate for some environments. Also, it does not allow the management of multiple key versions in one index. If the user changes his key, we have to re-encrypt the full directory which is not acceptable wrt performance for some environments.

The codec level encryption approach is more adequate for some environments than the fs level encryption approach. Also, it is to be noted that this codec does not affect the rest of Lucene/Solr. Users will be able to choose which approach is more adequate for their environment. This gives more options to Lucene/Solr users.

P.S.: I have created the issue LUCENE-6966 and move the discussion there, as it is more simple for external people to participate to the discussions.

Regards
--
Renaud Delbru

On 06/01/16 15:32, Robert Muir wrote:
I would strongly recommend against "invent your own mode", and instead
using standardized schemes/modes (e.g. XTS).

Separate from that, I don't understand the reasoning to do it at the
codec level. seems quite a bit more messy and complicated than the
alternatives, such as block device level (e.g. dm-crypt), or
filesystem level (e.g. ext4 filesystem encryption), which have the
advantage of the filesystem cache actually working.


On Wed, Jan 6, 2016 at 4:19 AM, Renaud Delbru <[email protected]> wrote:
Dear all,

We would like to contribute a codec that enables the encryption of sensitive
data in the index that has been developed as part of an engagement with a
customer. We think that this could be of interest for the community. If that
is the case, I’ll open a JIRA ticket and upload a first version of the
patch. We are also looking for feedbacks on the approach.

Below is a description of the project.

= Introduction

In comparison with approaches where all data is encrypted (e.g., file system
encryption, index output / directory encryption), encryption at a codec
level enables more fine-grained control on which block of data is encrypted.
This is more efficient since less data has to be encrypted. This also gives
more flexibility such as the ability to select which field to encrypt.

Some of the requirements for this project were:

- The performance impact of the encryption should be reasonable.
- The user can choose which field to encrypt.
- Key management: During the life cycle of the index, the user can provide a
new version of his encryption key. Multiple key versions should co-exist in
one index.

= What is supported ?

- Block tree terms index and dictionary
- Compressed stored fields format
- Compressed term vectors format
- Doc values format (prototype based on an encrypted index output) - this
will be submitted as a separated patch
- Index upgrader: command to upgrade all the index segments with the latest
key version available.

= How it is implemented ?

== Key Management

One index segment is encrypted with a single key version. An index can have
multiple segments, each one encrypted using a different key version. The key
version for a segment is stored in the segment info.

The provided codec is abstract, and a subclass is responsible in providing
an implementation of the cipher factory. The cipher factory is responsible
of the creation of a cipher instance based on a given key version.

== Encryption Model

The encryption model is based on AES/CBC with padding. Initialisation vector
(IV) is reused for performance reason, but only on a per format and per
segment basis.

While IV reuse is usually considered a bad practice, the CBC mode is somehow
resilient to IV reuse. The only "leak" of information that this could lead
to is being able to know that two encrypted blocks of data starts with the
same prefix. However, it is unlikely that two data blocks in an index
segment will start with the same data:

- Stored Fields Format: Each encrypted data block is a compressed block
(~4kb) of one or more documents. It is unlikely that two compressed blocks
start with the same data prefix.

- Term Vectors: Each encrypted data block is a compressed block (~4kb) of
terms and payloads from one or more documents. It is unlikely that two
compressed blocks start with the same data prefix.

- Term Dictionary Index: The term dictionary index is encoded and encrypted
in one single data block.

- Term Dictionary Data: Each data block of the term dictionary encodes a set
of suffixes. It is unlikely to have two dictionary data blocks sharing the
same prefix within the same segment.

- DocValues: A DocValues file will be composed of multiple encrypted data
blocks. It is unlikely to have two data blocks sharing the same prefix
within the same segment (each one will encodes a list of values associated
to a field).

To the best of our knowledge, this model should be safe. However, it would
be good if someone with security expertise in the community could review and
validate it.

= Performance

We report here a performance benchmark we did on an early prototype based on
Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all
the fields (id, title, body, date) were encrypted. Only the block tree terms
and compressed stored fields format were tested at that time.

== Indexing

The indexing throughput slightly decreased and is roughly 15% less than with
the base Lucene.

The merge time slightly increased by 35%.

There was no significant difference in term of index size.

== Query Throughput

With respect to query throughput, we observed no significant impact on the
following queries: Term query, boolean query, phrase query, numeric range
query.

We observed the following performance impact for queries that needs to scan
a larger portion of the term dictionary:

- prefix query: decrease of ~25%
- wildcard query (e.g., “fu*r”): decrease of ~60%
- fuzzy query (distance 1): decrease of ~40%
- fuzzy query (distance 2): decrease of ~80%

We can see that the decrease of performance is relative to the size of the
dictionary scan.

== Document Retrieval

We observed a decrease of performance that is relative to the size of the
set of documents to be retrieved:

- ~20% when retrieving a medium set of documents (100)
- ~30/40% when retrieving a large set of documents (1000)

= Known Limitations

- compressed stored field do not keep order of fields since non-encrypted
and encrypted fields are stored in separated blocks.

- the current implementation of the cipher factory does not enforce the use
of AES/CBC. We are planning to add this to the final version of the patch.

- the current implementation does not change the IV per segment. We are
planning to add this to the final version of the patch.

- the current implementation of compressed stored fields decrypts a full
compressed block even if a small portion is decompressed (high impact when
storing very small documents). We are planning to add this optimisation to
the final version of the patch. The overall document retrieval performance
might increase with this optimisation.

The codec has been implemented as a contrib. Given that most of the classes
were final, we had to copy most of the original code from the extended
formats. At a later stage, we could think of opening some of these classes
to extend them properly in order to reduce code duplication and simplify
code maintenance.

--
Renaud Delbru


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to