Chris Hostetter wrote:
Compression of stored fields is a feature that the Lucene "core"
currently
supports out of the box -- but it does so in a very limited maner
that
doesn't allow for much configuration. There is no advantage for
users in
using compressed fields over compressing the data themselves
before adding
it to the index, only disdvantages: notably the limited control
the user
has over the compression, and added complexity for the code path
executed
by all users -- even if they don't use compression (a boolean test on
"compressed" in FieldsReader may be fast ... but it's still a
bytecode op
for every field that's completley uneccessary for a large portion
of the
user base)
If the code was not already in the core, and someone asked about
adding it
I would argue against doing so on the grounds that some helpfull
utility
methods (possibly in a contrib) would be just as usefull, and
would have
no performance cost for people who don't care about compression.
Perhaps, if you look at compression on its own, but once you see
compression
in the context of all the other field options it makes sense to
have it
added to Lucene, it's about having everything in one place for ease of
implementation that offsets the performance issue, in my opinion.
First off, if all we are interested in in Encrypting *stored* data,
then the issue becomes exactly the same as compression: there is
no point
in putting this functionality in the "core" Lucene code base when
it can
be done using helper utility methods -- now that that's out of the
way,
let's talk about the good stuff...
As above
If we want to encrypt the text portion of Terms that are index for a
specific set of fields, this is again something that can easily be
done
without modifying the "core" Lucene code base -- utility methods
can be
used to help people encrypt UN_TOKENIZED Field values, and a simple
AnalyzerWrapper can be made to encrypt the text portion of Tokens
produced
by another analyzer both when indexing Field values and when
QueryParser
is Analyzing input text if neccessary.
I take your word for it, but wouldn't you agree that replacing all
the above
with just one line, "Field.Store.Encrypted" (or
Field.Store.Encrypt, for
compatibility with Field.Store.Compress),would be a lot easier to
use for
the average developer?
As others have already pointed out: encrypting just the Term text
doesn't
do much to aid the overall security of your data -- because a bad
guy with
access to your index can use the various statistics about your terms
(docFreq, term vectors, term positions, etc...) to aid them in
cracking
your encryption -- maybe a user is okay with that risk, in which
case my
previous comment about how this can easily be done without
modifying any
core lucene classes still holds. what about users who don't think
this is
an acceptible risk? ... a more robust encryption mechanism is
neccessary...
Security is a big topic, we cannot hope to discuss it here. I am
talking
about some form of data protection, not security.
When you say "a bad guy with access to your index", you imply that
nothing
can be done to protect the index. But accessing an index which you are
determined to protect would not be easy, would require expertise,
money, as
well as the risk of a potential jail sentence. If you have National
Security
in mind, be assured no agency responsible for national security
will use
open source software which is not certified, and that is downloaded
from an
unsecure site over the internet, in order to protect the nation (I
hope!).
If we are talking about applications which need to protect data
from curious
or even ill-intentioned eyes, then you can provide a deterrent by
encrypting
that sensitive data only. It might be a list of names, or balances, or
credit card numbers. Lucene alone can only provide some form of data
protection, not security. If you accept this limitation you will
find it
easier to accept the notion of encryption at field level, just like
some
relational database software encrypts at column level. Just as
importantly
you want to be able to search over that encrypted field, somehing
which my
proposed code provides (within the stated current limitation).
So exactly what pieces of data about a set of fields in an index
need to
be encrypted before you can adequetly say that those fields are
encrypted?
Off the top of my head i don't know, but I think the only way to
play it
safe is to assume thta *all* of the data needs to be encrypted.
Cannot agree here, it's application dependent. And keep in mind
that once
you offer new functionality people will find many original
applications for
it.
Now the question becomes: do we modify all of the index writitng/
reading
code
to add a lot of "if (encrypted) { ... } else { ... }" checks, or
is there
an easier way to ensure that all of the data in encrypted without
impacting the majority of hte user base?
A perfectly valid point, only benchmarking will tell by how much
the current
performance of Lucene will be impacted by the addition of encryption.
Somebody in this discussion suggested a Lucene benchmarking tool
which can
be used. I am not familiar with it, but if it is easy to run then
let's do
it and resolve factually this part of the discussion.
On a more philosophical level, are you saying that there should not
be any
added functionality to Lucene if it impacts the performance of
those who do
not need the additional functionality. This could be a major
limitation to
the future of Lucene. Perhaps one should set some small % limits to
the
level of impact, but zero could be too limiting.
I would argue that creating an EncryptedDirectory class with an
API that
looks something like this.......
.............
.............
- Do my concerns about that impact make sense to you?
- Does my (high level) description of how i think encryption
might make
sense as an optional Lucene feature make sense?
- are there any advantages you see to your approach that you feel
make it
more worthwhile then a Directory based approach?
Points one and two are pefectly valid and make a lot of sense.
Point three
is about what is best for the most, given that there is already an
OS option
to encrypt at directory level.
I like field encryption because it is functionality which cannot be
implemented at the OS level, and because of its granularity and its
similarity to existing Lucene functionality, it would be more
intuitive and
easier to implement at the application level. Encrypting everything
in a
directory would have a performance impact on the application.
I accept your point about the difference between a file system
directory and
a Lucene directory. But in order to overcome the lack of field-level
encryption and to minimise the performance impact on the
application you
would be forced to create a separate index and directory for each
field
which you want encrypted. It will work, but is not a solution I
would like
to have adopt at the application level.
Finally a point about my code. I was unsuccessful in creating a
diff file
because I was picking up all kind of formatting differences as
well. If you
scan it quickly you will find that is really very simple and, at
least in
its current limited implementation, hardly invasive of Lucene's
core. All
the encryption routines are in a separate class which i placed in the
utility package.
Victor
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
View this message in context: http://www.nabble.com/Attached-
proposed-modifications-to-Lucene-2.0-to-support-
Field.Store.Encrypted-tf2727614.html#a7708481
Sent from the Lucene - Java Developer mailing list archive at
Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]