You could store a value for each flag then be careful about what analyzers
you use. For instance, using WhitespaceAnalyzer (index AND search) and doing
your own casing. That is, make sure you lowercase as necessary (NOTE:
operators AND, OR NOT must not be lowercased if you send them through
queryparser) when you index and when you query.

Doc field for these is "flag"
Doc 1 has tokens indexed "joke=y", "adult=n" (input stream is "joke=y
adult=n")
Doc 2 has tokens indexed "joke=y", "adult=y"

Now your query for site 1 looks like "joke=y" AND "adult=y" (looking at flag
field)
and site 2 is "joke=y" AND "adult=n" (ditto)

Non jokes would just be "joke=n"

etc.....

Note especially that if you used, say, StandardAnalyzer, you'd get tokens
"joke", "y", "n", "adult" in your flag field for doc1 and "joke", "y",
"adult" for doc 2, which is not what you want at all.

This will certainly increase your size a bit. Do you have any idea how big
your indexes are going to be? If storing each field makes the index grow
from, say, 500M to 550M, you won't care. If you're storing a bazillion
documents and it'll bloat your index from 10G to 20G, you probably do. One
pertinent question is "how many fields do you expect to store per
document?"...

Best
Erick

On 11/28/06, Biggy <[EMAIL PROTECTED]> wrote:


The background of this is also separating content according to domains

Example:
- pictureA (marked as a "joke" #flag :1)
- pictureB (marked as a "adult picture" #flag: 2)
Site1: Users allowed to view everything (pictureA, pictureB )
Site2: Users allowed to view everything except pictureB (no adult content)

This szenario, for instance means a query from each site via sql could be
Site1: ... status & 3 ; // all pictures (joke,adult)
Site2:...  not (status & 1) ; // no adult stuff

PROBLEMS
Because the business rules are a negation - everything except this and
that.
We have a problem
adding new content types. Adding a new picture type means changing the
whole
picture flags with the new status.

So backward compatibility is not possible.

That's why i thought with Lucene, i could search using "NOT"
That is: Give me all non-adult pictures in case of Site2

Any suggestions to overcome this flag problem, without changing the DB
status and re-indexing everything on new picture types.

thanks for good advice thus far



Erick Erickson wrote:
>
> Lucene will automatically separate tokens during index and search if you
> use
> the right analyzer. See the various classes that implement Analyzer. I
> don't
> know if you really wanted to use the numeric literals, but I wouldn't.
The
> analyzers that do the most for you (automatically break up on spaces,
> lowercase, etc) often ignore numbers. Just in case you were thinking
about
> doing it that way....
>
> I would NOT store the inverse and then use NOT. the NOT operator doesn't
> behave as you expect, it's not a strict boolean operator. See the thread
> titled *Another problem with the QueryParser *in this list. And anything
> else Chris or Yonik or ...  has to say on the subject. This is a source
of
> considerable confusion. For instance, you can't query on just the phrase
> "NOT no_music". Not to mention what happens if/when a user can actually
> NOT
> in the query.
>
> In general, I *strongly* recommend doing it the simple, intuitive way
> first.
> Only get fancy if you actually have something to gain. Here, you're
> talking
> about some storage savings. Maybe (have you checked how big your index
> will
> be? Will this approach be enough to matter? How do you know?). You're
> creating code that will confuse not only yourself but whoever has to get
> into this code later.
>
> By rushing in and doing an optimization (which you neither  *know* nor
can
> reasonably expect to gain you anything measurable since you don't know
the
> internals of Lucene well enough to predict. Neither do I BTW...) you're
> creating complexity which you don't know is necessary. I'd only go there
> if
> doing it the straight-forward way shows performance issues. I'd also bet
> that any performance issues you see are not related to this issue......
>
> Best
> Erick
>
> On 11/28/06, Biggy <[EMAIL PROTECTED]> wrote:
>>
>>
>>
>> OK here what i've come up with - After reading your suggestions
>> - bit set from DB stays untouched
>> - only one field shall be used to store interest field bits in the
>> document:
>> "interest". Saves disk space.
>> - The bits shall be not be converted to readable string but added as
>> values
>> separated by space " "
>> ====Code Below====
>> -----------------
>> public Document getDocument(int db_interest_bits)
>> {
>>    String interest_string ="";
>>    // sport
>>    if (db_interest_bits & 1) {
>>        interest_string +="1"+" "; // empty space as delimiter
>>    }
>>    // music
>>    if (bitsfromdb & 2) {
>>        interest_string +="2"+" "; // empty space as delimiter
>>    }
>>
>>    Document doc = new Document();
>>    doc.add("interest", interest_string);
>>    // how do i tell Lucene to separate tokens on search ?
>>
>>    return doc;
>> }
>> ---------------
>>
>> FURTHERMORE - i realized that almost all potential values are often set
>> i.e.
>> sport music film
>> sport music
>> sport music film
>> sport music film
>> sport music
>> music
>>
>> So i was thinking : How about doing the reverse when it comes to
building
>> the index ?
>> I would onyl store the fields that are not set.
>> The search would be a negation.
>>
>> Example Values ofd interest:
>> 1. "no_film" => Only a film is not set
>> 2. "no_sport no_film" => film and sport are not set
>> 3. "" => all values are set since this is a negation
>>
>>
>> It follows, searching for people interested in music:
>> => search for NOT no_music
>>
>> QUESTION
>> How does the perfomance of a negative search NOT compare to a normal
one
>> I.E.
>> "NOT no_music" vs "music" search under the premise that most interest
>> flags
>> are set ?
>>
>>
>>
>> ---------
>>
>> Daniel Noll-3 wrote:
>> >
>> > Erick Erickson wrote:
>> >> Well, you really have the code already <G>. From the top...
>> >>
>> >> 1> there's no good way to support searching bitfields If you wanted,
>> you
>> >> could probably store it as a small integer and then search on it,
but
>> >> that's
>> >> waaay too complicated than you want.
>> >>
>> >> 2> Add the fields like you have the snippet from, something like
>> >> Document doc = new Document.
>> >> if (bitsfromdb & 1) {
>> >>    doc.add("sport", "y");
>> >> }
>> >> if (bitsfromdb & 2) {
>> >>    doc.add("music", "y");
>> >> }
>> >
>> > Beware that if there are a large number of bits, this is going to
>> impact
>> > memory usage due to there being more fields.
>> >
>> > Perhaps a better way would be to use a single "bits" field and store
>> the
>> > words "sport", "music", ... in that field.
>> >
>> > Daniel
>> >
>> >
>> > --
>> > Daniel Noll
>> >
>> > Nuix Pty Ltd
>> > Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280
>> 0699
>> > Web: http://nuix.com/                               Fax: +61 2 9212
>> 6902
>> >
>> > This message is intended only for the named recipient. If you are not
>> > the intended recipient you are notified that disclosing, copying,
>> > distributing or taking any action in reliance on the contents of this
>> > message or attachment is strictly prohibited.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> > For additional commands, e-mail: [EMAIL PROTECTED]
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Searching-by-bit-masks-tf2603918.html#a7576286
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
>

--
View this message in context:
http://www.nabble.com/Searching-by-bit-masks-tf2603918.html#a7581771
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to