Here's an old Lucene issue/patch for an AES encrypted Lucene directory class that might give you some ideas: https://issues.apache.org/jira/browse/LUCENE-2228
No idea what happened to it. An even older issue attempting to add encryption for specific fields: https://issues.apache.org/jira/browse/LUCENE-737 -- Jack Krupansky On Tue, Sep 8, 2015 at 11:07 AM, Adam Retter <adam.ret...@googlemail.com> wrote: > > The easiest way to do this is put the index over >> an encrypted file system. Encrypting the actual >> _tokens_ has a few problems, not the least of >> which is that any encryption algorithm worth >> its salt is going to make most searching totally >> impossible. >> > > I already suggested an encrypted filesystem to the customer but > unfortunately that was rejected. > > > Consider run, runner, running and runs with >> simple wildcards. Searching for run* requires that all 4 >> variants have 'run' as a prefix, and any decent >> encryption algorithm will not do that. Any >> encryption that _does_ make that search possible >> is trivially broken. I usually stop my thinking there, >> but ngrams, casing, WordDelimiterFilterFactory >> all come immediately to mind as "interesting". >> > > I was rather hoping that I could do the encryption and subsequent > decryption at a level below the search itself, so that when the query > examines the data it sees the decrypted values so that things like prefix > scans etc would indeed still work. Previously in this thread, Shawn > suggested writing a custom codec, I wonder if that would enable querying? > > >> But what about stored data you ask? Yes, the >> stored fields are compressed but stored verbatim, >> so I've seen arguments for encrypting _that_ stream, >> but that's really a "feel good" fig-leaf. If I get access to the >> index and it has position information, I can reconstruct >> documents without the stored data as Luke does. The >> process is a bit lossy, but the reconstructed document >> has enough fidelity that it'll give people seriously >> concerned about encryption conniption fits. >> > > Exactly! > > >> >> So all in all I have to back up Shawn's comments: You're >> better off isolating your Solr/Lucene system, putting >> authorization to view _documents_ at that level, and possibly >> using an encrypted filesystem. >> >> FWIW, >> Erick >> >> On Sat, Sep 5, 2015 at 7:27 AM, Shawn Heisey <apa...@elyograg.org> wrote: >> > On 9/5/2015 5:06 AM, Adam Retter wrote: >> >> I wondered if there is any facility already existing in Lucene for >> >> encrypting the values stored into the index and still being able to >> >> search them? >> >> >> >> If not, I wondered if anyone could tell me if this is impossible to >> >> implement, and if not to point me perhaps in the right direction? >> >> >> >> I imagine that just the text values and document fields to index (and >> >> optionally store) in the index would be either encrypted on the fly by >> >> Lucene using perhaps a public/private key mechanism. When a user issues >> >> a search query to Lucene they would also provide a key so that Lucene >> >> can decrypt the values as necessary to try and answer their query. >> > >> > I think you could probably add transparent encryption/decryption at the >> > Lucene level in a custom codec. That probably has implications for >> > being able to read the older index when it's time to upgrade Lucene, >> > with a complete reindex being the likely solution. Others will need to >> > confirm ... I'm not very familiar with Lucene code, I'm here for Solr. >> > >> > Any verification of user identity/permission is probably best done in >> > your own code, before it makes the Lucene query, and wouldn't >> > necessarily be related to the encryption. >> > >> > Requirements like this are usually driven by paranoid customers or >> > product managers. I think that when you really start to examine what an >> > attacker has to do to actually reach the unencrypted information (Lucene >> > index in this case), they already have acquired so much access that the >> > system is completely breached and it won't matter what kind of >> > encryption is added. >> > >> > I find many of these requirements to be silly, and put an incredible >> > burden on admin and developer resources with little or no benefit. >> > Here's an example of similar customer encryption requirement which I >> > encountered recently: >> > >> > We have a web application that has three "hops" involved. A user talks >> > to a load balancer, which talks to Apache, where the connection is then >> > proxied to a Tomcat server with the AJP protocol. The customer wanted >> > all three hops encrypted. The first hop was already encrypted, the >> > second was easy, but the third proved to be very difficult. Finally we >> > decided that we did not need load balancing on that last hop, and it >> > could simply talk to localhost, eliminating the need to encrypt it. >> > >> > The customer was worried about an attacker sniffing the traffic on the >> > LAN and seeing details like passwords. I consider this to be an insane >> > requirement. In order to sniff that traffic, the attacker would need >> > one of two things: Root access on a server, or physical access to the >> > infrastructure. Physical access can be escalated to root access if you >> > know what you're doing. Once someone has either of those things, >> > encrypted traffic won't matter, they will be able to learn anything they >> > need or do any damage they desire, without even needing to sniff the >> > traffic. >> > >> > Thanks, >> > Shawn >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: dev-h...@lucene.apache.org >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > > > -- > Adam Retter > > skype: adam.retter > tweet: adamretter > http://www.adamretter.org.uk >