Re: Index a source, but not store it... can it be done?

Doron Cohen Thu, 08 Mar 2007 08:58:47 -0800

Token positions are used also for phrase search.

You could probably compromise this by setting all token positions to 0 -
this would appear as if a document is a *set* of words (rather than a
*list*). An adversary would be able to know/guess what words are in each
document, (and, with (API) access to the index itself, how many times each
word appear in each document), but would not be able to reconstruct a
"good" approximation of that document, because term positions are all 0. If
this is sufficient, I think you can do it by writing your own Analyzer with
a TokenFilter that takes care of the position - see Token.
setPositionIncrement().


Hope this helps,
Doron

"Walt Stoneburner" <[EMAIL PROTECTED]> wrote on 08/03/2007
07:28:59:

> Have an interesting scenario I'd like to get your take on with respect
> to Lucene:
>
> A data provider (e.g. someone with a private website or corporately
> shared directory of proprietary documents) has requested their content
> be indexed with Lucene so employees can be redirected to it, but
> provisionally -- under no circumstance should that content be stored
> or recreated from the index.
>
> Is that even possible?
>
> The data owner's request makes sense in the context of them wanting to
> retain full access control via logins as well as collecting access
> metrics.
>
> If the token 'CAT' points to C:\Corporate\animals.doc and the token
> 'DOG' points also points there, then great, CAT AND DOG will give that
> document a higher rating, though it is not possible to reconstruct
> (with any great accuracy) what the actual document content is.
>
> However, if for the sake of using the NEAR operator with Lucene the
> tokens are stored as  LET'S:1 SELL:2 CAT:3 AND:4 DOG:5 ROBOT:6 TOYS:7
> THIS:8 DECEMBER:9 ... then someone could pull all tokens for
> animal.doc and reconstitute the token stream.
>
> Does Lucene have any kind of trade off for working with "secure" (and
> I use this term loosely) data?
>
> -wls


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Index a source, but not store it... can it be done?

Reply via email to