Alternatively, do not store values in the Solr fields. Return a key and fetch encrypted data from a database or other repository.
wunder Walter Underwood [email protected] http://observer.wunderwood.org/ (my blog) On Sep 5, 2015, at 9:40 AM, Erick Erickson <[email protected]> wrote: > The easiest way to do this is put the index over > an encrypted file system. Encrypting the actual > _tokens_ has a few problems, not the least of > which is that any encryption algorithm worth > its salt is going to make most searching totally > impossible. > > Consider run, runner, running and runs with > simple wildcards. Searching for run* requires that all 4 > variants have 'run' as a prefix, and any decent > encryption algorithm will not do that. Any > encryption that _does_ make that search possible > is trivially broken. I usually stop my thinking there, > but ngrams, casing, WordDelimiterFilterFactory > all come immediately to mind as "interesting". > > But what about stored data you ask? Yes, the > stored fields are compressed but stored verbatim, > so I've seen arguments for encrypting _that_ stream, > but that's really a "feel good" fig-leaf. If I get access to the > index and it has position information, I can reconstruct > documents without the stored data as Luke does. The > process is a bit lossy, but the reconstructed document > has enough fidelity that it'll give people seriously > concerned about encryption conniption fits. > > So all in all I have to back up Shawn's comments: You're > better off isolating your Solr/Lucene system, putting > authorization to view _documents_ at that level, and possibly > using an encrypted filesystem. > > FWIW, > Erick > > On Sat, Sep 5, 2015 at 7:27 AM, Shawn Heisey <[email protected]> wrote: >> On 9/5/2015 5:06 AM, Adam Retter wrote: >>> I wondered if there is any facility already existing in Lucene for >>> encrypting the values stored into the index and still being able to >>> search them? >>> >>> If not, I wondered if anyone could tell me if this is impossible to >>> implement, and if not to point me perhaps in the right direction? >>> >>> I imagine that just the text values and document fields to index (and >>> optionally store) in the index would be either encrypted on the fly by >>> Lucene using perhaps a public/private key mechanism. When a user issues >>> a search query to Lucene they would also provide a key so that Lucene >>> can decrypt the values as necessary to try and answer their query. >> >> I think you could probably add transparent encryption/decryption at the >> Lucene level in a custom codec. That probably has implications for >> being able to read the older index when it's time to upgrade Lucene, >> with a complete reindex being the likely solution. Others will need to >> confirm ... I'm not very familiar with Lucene code, I'm here for Solr. >> >> Any verification of user identity/permission is probably best done in >> your own code, before it makes the Lucene query, and wouldn't >> necessarily be related to the encryption. >> >> Requirements like this are usually driven by paranoid customers or >> product managers. I think that when you really start to examine what an >> attacker has to do to actually reach the unencrypted information (Lucene >> index in this case), they already have acquired so much access that the >> system is completely breached and it won't matter what kind of >> encryption is added. >> >> I find many of these requirements to be silly, and put an incredible >> burden on admin and developer resources with little or no benefit. >> Here's an example of similar customer encryption requirement which I >> encountered recently: >> >> We have a web application that has three "hops" involved. A user talks >> to a load balancer, which talks to Apache, where the connection is then >> proxied to a Tomcat server with the AJP protocol. The customer wanted >> all three hops encrypted. The first hop was already encrypted, the >> second was easy, but the third proved to be very difficult. Finally we >> decided that we did not need load balancing on that last hop, and it >> could simply talk to localhost, eliminating the need to encrypt it. >> >> The customer was worried about an attacker sniffing the traffic on the >> LAN and seeing details like passwords. I consider this to be an insane >> requirement. In order to sniff that traffic, the attacker would need >> one of two things: Root access on a server, or physical access to the >> infrastructure. Physical access can be escalated to root access if you >> know what you're doing. Once someone has either of those things, >> encrypted traffic won't matter, they will be able to learn anything they >> need or do any damage they desire, without even needing to sniff the >> traffic. >> >> Thanks, >> Shawn >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] >
