On Mar 9, 2005, at 10:09 AM, javier muguruza wrote:
(I sent this to the old list, I dont know wether it reached the list...just in case I repost it)
Hi all,
We index our documents in the following way:
doc = new Document(); // mailid doc.add(Field.UnIndexed("mid",mid)); //body doc.add(Field.UnStored("body", textb));
mid is a unique identifier, and body contains long pieces of text to be indexed.
And later make searches on the body field, the mid allows us to find a file on the filesystem with a compressed (and digitally signed) version of the original body indexed. Our way to work in a query in our app is this: 1. first we make a search in a db (for many different reasons) that returns a number (from 0 to thousands) of mid 2. we use lucene to search for some text in many indexes, this returns a second list of mid 3. we return the result as the intersection of both lists.
This is working fine right now, but wonder wether we are not using lucene to the fullest, cause we could also store mid as a keyword (instead of unindexed), and add the condition (AND mid==[any mid from our step 1]) to the lucene query we run. My questions are:
1. Is there a limit in the number of conditions I can add to a query?? Sometimes we have 10 mids, other times we have thousands of them so we would have to add: AND (mid:mid1 OR mid:mid2 ... OR mid:mid10000). Probably there is a limit, and we could only apply the mid conditions when the number or mids returned by step 1 is smaller than that limit?
BooleanQuery has a built-in limit of 1,024 clauses so it would only be useful when there is a small number of mids. Consider using a Filter though. There are some built-in ones, but maybe a custom one is best.
2. As the mid is a unique identifier (I guest lucene does not care about that right?)
Right, Lucene doesn't care about field/term uniqueness.
, and the condition on the mid woudl be ANDed to the text query conditions, will it be faster for lucene to look first in the mid field and dont do the text lookup if the mid condition is not fullfilled? I dont know wether I am clear enough...Will I get some benefit on the queries by adding some additional conditions or the cost of adding another field to index will not pay off? Maybe it depends on the number of documents? Maybe it would be best to set mid as a keyword just in case, and add it as conditions later if the searches take too long?
I doubt you'd even notice the difference. There is little cost to adding the additional field, and looks like you'd benefit from having mid as a Keyword.
Also, with a Filter, you could use it to bounce to your relational database to constrain results based on a set of mids. Filters are designed to be used for multiple queries and cached - keep that in mind and maybe it'll work out well in your scenario.
Erik
thanks for any though on that
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]