Here's an old Lucene issue/patch for an AES encrypted Lucene directory
class that might give you some ideas:
https://issues.apache.org/jira/browse/LUCENE-2228

No idea what happened to it.

An even older issue attempting to add encryption for specific fields:
https://issues.apache.org/jira/browse/LUCENE-737

-- Jack Krupansky

On Tue, Sep 8, 2015 at 11:07 AM, Adam Retter <adam.ret...@googlemail.com>
wrote:

>
> The easiest way to do this is put the index over
>> an encrypted file system. Encrypting the actual
>> _tokens_ has a few problems, not the least of
>> which is that any encryption algorithm worth
>> its salt is going to make most searching totally
>> impossible.
>>
>
> I already suggested an encrypted filesystem to the customer but
> unfortunately that was rejected.
>
>
> Consider run, runner, running and runs with
>> simple wildcards. Searching for run* requires that all 4
>> variants have 'run' as a prefix, and any decent
>> encryption algorithm will not do that. Any
>> encryption that _does_ make that search possible
>> is trivially broken. I usually stop my thinking there,
>> but ngrams, casing, WordDelimiterFilterFactory
>> all come immediately to mind as "interesting".
>>
>
> I was rather hoping that I could do the encryption and subsequent
> decryption at a level below the search itself, so that when the query
> examines the data it sees the decrypted values so that things like prefix
> scans etc would indeed still work. Previously in this thread, Shawn
> suggested writing a custom codec, I wonder if that would enable querying?
>
>
>> But what about stored data you ask? Yes, the
>> stored fields are compressed but stored verbatim,
>> so I've seen arguments for encrypting _that_ stream,
>> but that's really a "feel good" fig-leaf. If I get access to the
>> index and it has position information, I can reconstruct
>> documents without the stored data as Luke does. The
>> process is a bit lossy, but the reconstructed document
>> has enough fidelity that it'll give people seriously
>> concerned about encryption conniption fits.
>>
>
> Exactly!
>
>
>>
>> So all in all I have to back up Shawn's comments: You're
>> better off isolating your Solr/Lucene system, putting
>> authorization to view _documents_ at that level, and possibly
>> using an encrypted filesystem.
>>
>> FWIW,
>> Erick
>>
>> On Sat, Sep 5, 2015 at 7:27 AM, Shawn Heisey <apa...@elyograg.org> wrote:
>> > On 9/5/2015 5:06 AM, Adam Retter wrote:
>> >> I wondered if there is any facility already existing in Lucene for
>> >> encrypting the values stored into the index and still being able to
>> >> search them?
>> >>
>> >> If not, I wondered if anyone could tell me if this is impossible to
>> >> implement, and if not to point me perhaps in the right direction?
>> >>
>> >> I imagine that just the text values and document fields to index (and
>> >> optionally store) in the index would be either encrypted on the fly by
>> >> Lucene using perhaps a public/private key mechanism. When a user issues
>> >> a search query to Lucene they would also provide a key so that Lucene
>> >> can decrypt the values as necessary to try and answer their query.
>> >
>> > I think you could probably add transparent encryption/decryption at the
>> > Lucene level in a custom codec.  That probably has implications for
>> > being able to read the older index when it's time to upgrade Lucene,
>> > with a complete reindex being the likely solution.  Others will need to
>> > confirm ... I'm not very familiar with Lucene code, I'm here for Solr.
>> >
>> > Any verification of user identity/permission is probably best done in
>> > your own code, before it makes the Lucene query, and wouldn't
>> > necessarily be related to the encryption.
>> >
>> > Requirements like this are usually driven by paranoid customers or
>> > product managers.  I think that when you really start to examine what an
>> > attacker has to do to actually reach the unencrypted information (Lucene
>> > index in this case), they already have acquired so much access that the
>> > system is completely breached and it won't matter what kind of
>> > encryption is added.
>> >
>> > I find many of these requirements to be silly, and put an incredible
>> > burden on admin and developer resources with little or no benefit.
>> > Here's an example of similar customer encryption requirement which I
>> > encountered recently:
>> >
>> > We have a web application that has three "hops" involved.  A user talks
>> > to a load balancer, which talks to Apache, where the connection is then
>> > proxied to a Tomcat server with the AJP protocol.  The customer wanted
>> > all three hops encrypted.  The first hop was already encrypted, the
>> > second was easy, but the third proved to be very difficult.  Finally we
>> > decided that we did not need load balancing on that last hop, and it
>> > could simply talk to localhost, eliminating the need to encrypt it.
>> >
>> > The customer was worried about an attacker sniffing the traffic on the
>> > LAN and seeing details like passwords.  I consider this to be an insane
>> > requirement.  In order to sniff that traffic, the attacker would need
>> > one of two things:  Root access on a server, or physical access to the
>> > infrastructure.  Physical access can be escalated to root access if you
>> > know what you're doing.  Once someone has either of those things,
>> > encrypted traffic won't matter, they will be able to learn anything they
>> > need or do any damage they desire, without even needing to sniff the
>> > traffic.
>> >
>> > Thanks,
>> > Shawn
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>
>
> --
> Adam Retter
>
> skype: adam.retter
> tweet: adamretter
> http://www.adamretter.org.uk
>

Reply via email to