The problem with encrypted file systems is that if someone gets access to
the file system (not the disk, the file system e.g via ssh), it is wide
open to it. It's like my work laptop's disk is encrypted, but after I've
entered my password, all files are readable to me. However, files that are
password protected, aren't, and that's what security experts want - that
even if an attacker stole the machine and has all the passwords and the
time in the world, without the public/private key of the encrypted index,
he won't be able to read it. I'm not justifying it, just repeating what I
was told. Even though I think it's silly - if someone managed to get a hold
of the machine, the login password, root access... what are the chance he
doesn't already have the other keys?

Anyway, we're here to solve the technical problem, and we obviously aren't
the ones making these decisions, and it's futile attempting to argue with
security folks, so let's address the question of how to achieve encryption.

I wouldn't go with a Codec, personally, to achieve encryption. It's over
complicated IMO. Rather an encrypted Directory is a simpler solution. You
will need to implement an EncryptingIndexOutput and a matching
DecryptingIndexInput, but that's more or less it. The encryption/decryption
happens in buffers, so you will want to extend the respective BufferedIO
classes. The issues mentioned above should give you a head start, even
though the patches are old and likely don't compile against new versions,
but they contain the gist of it.

Just make sure your application, or actually the process running Lucene,
receive the public/private key in a non obvious way, so that if someone
does get a hold of the machine, he can't obtain that information!

Also, as for encrypting the terms themselves, beyond the problems mentioned
above about wildcard queries, there is the risk of someone guessing the
terms based on their statistics. If the attacker knows the corpus domain, I
assume it shouldn't be hard for him to guess that a certain word with a
high DF and TF is probably "the" and proceed from there.

Again, I'm no security expert and I've learned it's sometimes futile trying
to argue with them. If you can convince them though that the system as a
whole is protected enough, and if breached an encrypted index is likely
already breached too, you can avoid the complexity. From my experience,
encryption hurts performance, but you can improve that by eg buffering
parts unencrypted, but then you also need to prove your program's memory is
protected...

Hope this helps.

Shai
On Sep 8, 2015 8:18 PM, "Erick Erickson" <[email protected]> wrote:

> Adam:
>
> Yeah, I've seen client requirements that cause me to scratch
> my head. I suppose, though, some argument can be made
> that having a separate encrypting key for the index itself that's
> completely separate from any more widely-known encryption
> key for a disk is a valid argument. You could even have different
> encryption keys for, say, each user's index or something.
>
> bq: I was rather hoping that I could do the encryption and subsequent
> decryption at a level below the search itself
>
> Aside from the different encryption key per index (or whatever), why
> does the client think this is any more secure than an encrypted disk?
>
> Just askin'....
>
> Erick
>
> On Tue, Sep 8, 2015 at 8:21 AM, Jack Krupansky <[email protected]>
> wrote:
> > Here's an old Lucene issue/patch for an AES encrypted Lucene directory
> class
> > that might give you some ideas:
> > https://issues.apache.org/jira/browse/LUCENE-2228
> >
> > No idea what happened to it.
> >
> > An even older issue attempting to add encryption for specific fields:
> > https://issues.apache.org/jira/browse/LUCENE-737
> >
> > -- Jack Krupansky
> >
> > On Tue, Sep 8, 2015 at 11:07 AM, Adam Retter <[email protected]
> >
> > wrote:
> >>
> >>
> >>> The easiest way to do this is put the index over
> >>> an encrypted file system. Encrypting the actual
> >>> _tokens_ has a few problems, not the least of
> >>> which is that any encryption algorithm worth
> >>> its salt is going to make most searching totally
> >>> impossible.
> >>
> >>
> >> I already suggested an encrypted filesystem to the customer but
> >> unfortunately that was rejected.
> >>
> >>
> >>> Consider run, runner, running and runs with
> >>> simple wildcards. Searching for run* requires that all 4
> >>> variants have 'run' as a prefix, and any decent
> >>> encryption algorithm will not do that. Any
> >>> encryption that _does_ make that search possible
> >>> is trivially broken. I usually stop my thinking there,
> >>> but ngrams, casing, WordDelimiterFilterFactory
> >>> all come immediately to mind as "interesting".
> >>
> >>
> >> I was rather hoping that I could do the encryption and subsequent
> >> decryption at a level below the search itself, so that when the query
> >> examines the data it sees the decrypted values so that things like
> prefix
> >> scans etc would indeed still work. Previously in this thread, Shawn
> >> suggested writing a custom codec, I wonder if that would enable
> querying?
> >>
> >>>
> >>> But what about stored data you ask? Yes, the
> >>> stored fields are compressed but stored verbatim,
> >>> so I've seen arguments for encrypting _that_ stream,
> >>> but that's really a "feel good" fig-leaf. If I get access to the
> >>> index and it has position information, I can reconstruct
> >>> documents without the stored data as Luke does. The
> >>> process is a bit lossy, but the reconstructed document
> >>> has enough fidelity that it'll give people seriously
> >>> concerned about encryption conniption fits.
> >>
> >>
> >> Exactly!
> >>
> >>>
> >>>
> >>> So all in all I have to back up Shawn's comments: You're
> >>> better off isolating your Solr/Lucene system, putting
> >>> authorization to view _documents_ at that level, and possibly
> >>> using an encrypted filesystem.
> >>>
> >>> FWIW,
> >>> Erick
> >>>
> >>> On Sat, Sep 5, 2015 at 7:27 AM, Shawn Heisey <[email protected]>
> wrote:
> >>> > On 9/5/2015 5:06 AM, Adam Retter wrote:
> >>> >> I wondered if there is any facility already existing in Lucene for
> >>> >> encrypting the values stored into the index and still being able to
> >>> >> search them?
> >>> >>
> >>> >> If not, I wondered if anyone could tell me if this is impossible to
> >>> >> implement, and if not to point me perhaps in the right direction?
> >>> >>
> >>> >> I imagine that just the text values and document fields to index
> (and
> >>> >> optionally store) in the index would be either encrypted on the fly
> by
> >>> >> Lucene using perhaps a public/private key mechanism. When a user
> >>> >> issues
> >>> >> a search query to Lucene they would also provide a key so that
> Lucene
> >>> >> can decrypt the values as necessary to try and answer their query.
> >>> >
> >>> > I think you could probably add transparent encryption/decryption at
> the
> >>> > Lucene level in a custom codec.  That probably has implications for
> >>> > being able to read the older index when it's time to upgrade Lucene,
> >>> > with a complete reindex being the likely solution.  Others will need
> to
> >>> > confirm ... I'm not very familiar with Lucene code, I'm here for
> Solr.
> >>> >
> >>> > Any verification of user identity/permission is probably best done in
> >>> > your own code, before it makes the Lucene query, and wouldn't
> >>> > necessarily be related to the encryption.
> >>> >
> >>> > Requirements like this are usually driven by paranoid customers or
> >>> > product managers.  I think that when you really start to examine what
> >>> > an
> >>> > attacker has to do to actually reach the unencrypted information
> >>> > (Lucene
> >>> > index in this case), they already have acquired so much access that
> the
> >>> > system is completely breached and it won't matter what kind of
> >>> > encryption is added.
> >>> >
> >>> > I find many of these requirements to be silly, and put an incredible
> >>> > burden on admin and developer resources with little or no benefit.
> >>> > Here's an example of similar customer encryption requirement which I
> >>> > encountered recently:
> >>> >
> >>> > We have a web application that has three "hops" involved.  A user
> talks
> >>> > to a load balancer, which talks to Apache, where the connection is
> then
> >>> > proxied to a Tomcat server with the AJP protocol.  The customer
> wanted
> >>> > all three hops encrypted.  The first hop was already encrypted, the
> >>> > second was easy, but the third proved to be very difficult.  Finally
> we
> >>> > decided that we did not need load balancing on that last hop, and it
> >>> > could simply talk to localhost, eliminating the need to encrypt it.
> >>> >
> >>> > The customer was worried about an attacker sniffing the traffic on
> the
> >>> > LAN and seeing details like passwords.  I consider this to be an
> insane
> >>> > requirement.  In order to sniff that traffic, the attacker would need
> >>> > one of two things:  Root access on a server, or physical access to
> the
> >>> > infrastructure.  Physical access can be escalated to root access if
> you
> >>> > know what you're doing.  Once someone has either of those things,
> >>> > encrypted traffic won't matter, they will be able to learn anything
> >>> > they
> >>> > need or do any damage they desire, without even needing to sniff the
> >>> > traffic.
> >>> >
> >>> > Thanks,
> >>> > Shawn
> >>> >
> >>> >
> >>> > ---------------------------------------------------------------------
> >>> > To unsubscribe, e-mail: [email protected]
> >>> > For additional commands, e-mail: [email protected]
> >>> >
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >>
> >>
> >>
> >> --
> >> Adam Retter
> >>
> >> skype: adam.retter
> >> tweet: adamretter
> >> http://www.adamretter.org.uk
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to