Lucene search result no stable
Hi, experts, Our application pulls data out of DB and write as lucene document every 5 minutes. We use a loop to keep updating document. But we only use 1 writing thread and always closes IndexWriter once we finish writing. Unfortunately, we always got IOException and FileNotFoundException ( I guess there is some problem with the lock time setting inside Lucene, though I am not sure). Our IndexSearcher doesn't do deletion, so it is thread safety. But the problem is we sometimes get the results (e.g 10) we expect, sometimes we get nothing. I traced with other engineers through our code using Lucene. It looks like it is pretty straight and correct. But the exceptions keep being thrown and I think the unstable search result is also related to these exceptions. What might be the problem? How to solve it? Any suggestion or idea will be appreciated. Thanks. __ Do you Yahoo!? Yahoo! Hotjobs: Enter the "Signing Bonus" Sweepstakes http://hotjobs.sweepstakes.yahoo.com/signingbonus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Exception thrown from IndexFiles..help!
> Hi guys! > i just downloaded the lucene software and start the tutorial simple = > demo... > i tried to execute the IndexFiles.java on a NetBeans envirement on = > Windowx xp platform... > > i replaced args[0] with the path: "C:\\lucene-1.2\\src" but i have the = > following exception: > > >> caught a java.lang.ArrayIndexOutOfBounds Exception=20 > >> with message : null > > > please help me to get started! > > i also want to know how to set classpath in a win xp platform... > > thanks for your help... > > othman. > > > > - > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: difference in javadoc and faq similarity expression
Nicolas Maisonneuve wrote: in the Similarity Javadoc score(q,d) =Sum [tf(t in d) * idf(t) * getBoost(t.field in d) * lengthNorm(t.field in d) * coord(q,d) * queryNorm(q) ] in the FAQ score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d In FAQ | In Javadoc 1 / norm_q = queryNorm(q) 1 / norm_d_t=lengthNorm(t.field in d) coord_q_d=coord(q,d) boost_t=getBoost(t.field in d) idf_t=idf(t) tf_d=tf(t in d) but where is the javadoc expression for "tf_q" faq expression I think tf_q is always 1.0. If a term occurs twice in the query then Lucene considers them as two terms with tf_q = 1.0 rather than a single term with tf_q = 2.0. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: setMaxClauseCount ??
setMaxClauseCount determines the maximum number of clauses, which is not your problem here. Your problem is with required clauses. There may only be a total of 31 required (or prohibited) clauses in a single BooleanQuery. If you need more, then create more BooleanQueries and combine them with another BooleanQuery. Perhaps this could be done automatically, but I've never heard anyone encounter this limit before. Do you really mean for 32 different terms to be required? Do any documents actually match this query? Doug Karl Koch wrote: Hi group, I run over a IndexOutOfBoundsException: -> java.lang.IndexOutOfBoundsException: More than 32 required/prohibited clauses in query. The reason: I have more then 32 BooleanCauses. From the Mailinglist I got the info how to set the maxiumum number of clauses higher before a loop: ... myBooleanQuery.setMaxClauseCount(Integer.MAX_VALUE); while (true){ Token token = tokenStream.next(); if (token == null) { break; } myBooleanQuery.add(new TermQuery(new Term("bla", token.termText())), true, false); } ... However the error still remains, why? Karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: mergeFactor and maxMergeDocs
my job is to measure and benchmark for capacity planning purposes. that means knowing how much room i have to work with on the tuning knobs. Herb... -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 20, 2004 12:18 PM To: Lucene Users List Subject: Re: mergeFactor and maxMergeDocs Obsession with indexing performance is not healthy. Before changing any settings convince yourself that indexing performance is a real problem for your application. How often do you re-index from scratch? Are you really having any difficulty keeping up with the rate of change of your collection? Perhaps your development time would be better spent focussing on other parts of your application. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Gettting all index fields of an index
Try calling IndexReader.getFieldNames(). Karl Koch wrote: How can I get a list of all fields in an index from which I know only the directory string? Karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Ordening documents
Yes, this is correct. Peter Keegan wrote: So they are sorted by reverse document number. Is this the 'external' document number (the one that is adjusted for the segment's base)? If so, then this means that documents with equal score are returned in the order in which they were added to the index. Is this correct? Thanks, Peter - Original Message - From: "Morus Walter" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Saturday, January 17, 2004 4:57 PM Subject: Re: Ordening documents Peter Keegan writes: What is the returned order for documents with identical scores? have a look at the source of the lessThan method in org.java.lucene.search.HitQueue: protected final boolean lessThan(Object a, Object b) { ScoreDoc hitA = (ScoreDoc)a; ScoreDoc hitB = (ScoreDoc)b; if (hitA.score == hitB.score) return hitA.doc > hitB.doc; else return hitA.score < hitB.score; } sorting is done by this method. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexReader.document(int i)
Nicolas Maisonneuve wrote: i would like to know in the IndexReader.document(int i) what is this number i ? if the the first document is the oldest document indexed and the last the youngest ? (so we can sort by date easyly) ? Yes, documents with lower numbers were indexed earlier. As documents are deleted the numbers of other, higher documents shift downwards, but the order of document numbers always represents the order that documents were added to the index. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: mergeFactor and maxMergeDocs
Chong, Herb wrote: what effect and what recommendations are valid for Lucene 1.3? Same as always: use the defaults and call optimize() only when you know you won't be changing the index for a while. If you have lots of RAM, increasing minMergeDocs may increase indexing speed, but raising it too high may cause out of memory problems. You may also see some indexing speedup by increasing the mergeFactor, but raising it too high will cause file handle problems. Calling setUseCompoundFile() will enable higher mergeFactor settings before encountering file handle problems. Obsession with indexing performance is not healthy. Before changing any settings convince yourself that indexing performance is a real problem for your application. How often do you re-index from scratch? Are you really having any difficulty keeping up with the rate of change of your collection? Perhaps your development time would be better spent focussing on other parts of your application. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Query Term Questions
1) Is there a way to set the query boost factor depending not on the presence of a term, but on the presence of two specific terms? For example, I may want to boost the relevance of a document that contains both "iraq" and "clerics", but not boost the relevance of documents that contain only one or the other terms. (The idea is better discrimination than if I simply boosted both terms.) 2) Is it possible to apply (or simulate) a negative query boost factor? For example, I may have a complex query with lots of terms but want to reduce the relevance of a matching document that also included the term "iowa". ( The idea is for an easier and more discriminating way than simply increasing the relevance of all other terms besides "iowa"). 3) Is there a way to handle variants of a phrase without OR'ing together the variants? For example, I may want to find documents dealing with North Korea; the terms might be "north korea" or "north korean" or "north koreans" - is there a way to handle this with a single term using wildcards? Regards, Terry
QueryParser and stopwords
Hi, I'm currently trying to get rid of query parser problems with stopwords (depending on the query, there are ArrayIndexOutOfBoundsExceptions, e.g. for stop AND nonstop where stop is a stopword and nonstop not). While this isn't hard to fix (I'll enter a bug and patch in bugzilla), there's one issue left, I'm not sure how to deal with: What should the query parser return for a query string containing only stopwords? And when I think about this, there's another one: stop AND NOT nonstop creates a boolean query, only containing prohibited terms, which AFAIK cannot be used in a search. How to deal with this? Currently it returns an empty BooleanQuery. I think it would be more useful to return null in this case. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]