On Mar 9, 2005, at 10:09 AM, javier muguruza wrote:
(I sent this to the old list, I dont know wether it reached the
list...just in case I repost it)

Hi all,

We index our documents in the following way:

        doc = new Document();
        // mailid
        doc.add(Field.UnIndexed("mid",mid));
        //body
        doc.add(Field.UnStored("body", textb));

mid is a unique identifier, and body contains long pieces of text to be indexed.

And later make searches on the body field, the mid allows us to find a
file on the filesystem with a compressed (and digitally signed)
version of the original body indexed.
Our way to work in a query in our app is this:
1. first we make a search in a db (for many different reasons) that
returns a number (from 0 to thousands) of mid
2. we use lucene to search for some text in many indexes, this returns
a second list of mid
3. we return the result as the intersection of both lists.

This is working fine right now, but wonder wether we are not using
lucene to the fullest, cause we could also store mid as a keyword
(instead of unindexed), and add the condition (AND mid==[any mid from
our step 1]) to the lucene query we run. My questions are:

1. Is there a limit in the number of conditions I can add to a query??
Sometimes we have 10 mids, other times we have thousands of them so we
would have to add: AND (mid:mid1 OR mid:mid2 ... OR mid:mid10000).
Probably there is a limit, and we could only apply the mid conditions
when the number or mids returned by step 1 is smaller than that limit?

BooleanQuery has a built-in limit of 1,024 clauses so it would only be useful when there is a small number of mids. Consider using a Filter though. There are some built-in ones, but maybe a custom one is best.


2. As the mid is a unique identifier (I guest lucene does not care
about that right?)

Right, Lucene doesn't care about field/term uniqueness.

, and the condition on the mid woudl be ANDed to the
text query conditions, will it be faster for lucene to look first in
the mid field and dont do the text lookup if the mid condition is not
fullfilled? I dont know wether I am clear enough...Will I get some
benefit on the queries by adding some additional conditions or the
cost of adding another field to index will not pay off? Maybe it
depends on the number of documents? Maybe it would be best to set mid
as a keyword just in case, and add it as conditions later if the
searches take too long?

I doubt you'd even notice the difference. There is little cost to adding the additional field, and looks like you'd benefit from having mid as a Keyword.


Also, with a Filter, you could use it to bounce to your relational database to constrain results based on a set of mids. Filters are designed to be used for multiple queries and cached - keep that in mind and maybe it'll work out well in your scenario.

        Erik



thanks for any though on that

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to