Thank you Erick! I got Luke and it's a great tool! I verified from Luke my queries posted originally worked as expected (i.e., "Canadian pepsi" produced fewer results than "pepsi" along.)
Based on your suggestion, I found out the program re-wrote the query before it was sent to Nutch as the following: (content:"+(content:pepsi) +((host:ca^10.0)^10.0)") I think this is messing up the query results. Literally, this is what the code was doing: Query nutchQuery = new Query(); nutchQuery.addRequiredTerm("+(content:pepsi) +((host:ca^10.0)^10.0)", "content"); nutchBean.search(nutchQuery, maxHits, maxHitsPerSite); The above code must be using a default analyzer, which I have not found out yet. But after looking at Luke results and this code, I think my problem is in the query construction. Could you shed some light about how the new query be constructed based on the following String? +(content:pepsi) +((host:ca^10.0)^10.0) Thanks again! Erick Erickson wrote: > > For Luke, see http://www.getopt.org/luke/ or just google lucene luke > > About the same analyzers at index and search. No, it's not *required* > that you use the same analyzer, but unless and until you understand > what analyzers do, I would *strongly* recommend that you use the > same one. See PerFieldAnalyzerWrapper for using different analyzers > for different fields.... > > Conceptually, think of an analyzer as giving you the "smallest meaningful > token" from your stream that your program then does something with, > either index or analyze. > > Consider the line "this is the 2nd Line in Erick's program". What are the > tokens? > Well, depending on the analyzer you could get > nd, line, erick, program (assuming your analyzer lowercases, strips > numbers > and punctuation and removes stopwords) > > You could get > "this is the 2nd Line in Erick's program" (assuming your analyzer did > nothing, just returned the entire input stream as a token, which would > then > NOT get a hit searching, say, 'Line') > > You could get > this, is, the, 2nd, line, in erick's, program (assuming your analyzer just > lowercased and split on whitespace) > > You could get > nonsense, nonsense, nonsense, nonsense (assuming you wrote a pathological > analyzer that always returned the word "nonsense") > > Really, any transformations you wanted to perform can be done in an > analyzer. > So using different analyzers during indexing and searching will very often > produce "surprising" results. Don't do it unless you know exactly what's > being > done to your input streams. Or be prepared to spend a lot of hours > tracking > down "what went wrong" <G>. > > All that said, it doesn't explain the behaviour you're seeing because > you're > right, > adding successively more restrictions should produce smaller result > counts. > I suspect > that if you see what actual queries you're generating (and when you get > Luke, be > sure to find the drop-down for "which analyzer" to use) you'll find > something > surprising. > > Best > Erick > > > On Fri, Sep 26, 2008 at 6:57 PM, student_t <[EMAIL PROTECTED]> wrote: > >> >> Hi Eric, >> >> Thanks a bunch for your pointers. I will need to find out the analyzers >> at >> index and query time. But is it critical to have the same analyzers >> during >> these two times? >> >> I had tested with lucli from some of my local segment data and they >> appeared >> working fine (i.e., their result sets are reasonable.) >> >> Is Luke part of Lucene contrib? I recall there is a GUI that lets you >> view >> the indices. Would you please elaborate? >> >> Thanks again! >> student_t >> >> >> Erick Erickson wrote: >> > >> > That certainly doesn't look right. What analyzers are you using at >> index >> > and query time? >> > >> > Two things that will help track down what's really happening: >> > >> > 1> query.toString() is your friend. >> > 2> get a copy of the excellent Luke tool and have it do its explain >> magic >> > on >> > your query. Watch that the analyzer you choose when querying is what >> you >> > expect.... >> > >> > If neither of those things sheds any light on the problem, let us know >> > what >> > you find.... >> > >> > Best >> > Erick >> > >> > On Fri, Sep 26, 2008 at 3:55 PM, student_t <[EMAIL PROTECTED]> wrote: >> > >> >> >> >> I am baffled by the results of the following queries. Can it be >> something >> >> to >> >> do with the boosting factor? All of these queries are performed in the >> >> same >> >> environment with the same crawled index/data. >> >> >> >> A. query1 = +(content:(Pepsi)) resulted >> in >> >> 228 >> >> hits. >> >> B. query2 = +(content:(Pepsi) ) +(host:(ca)^10 ) resulted in 398 >> >> hits. >> >> C. query3 = +(host:(ca)^10 ) resulted >> in >> >> 212 >> >> hits. >> >> >> >> Two questions (strictly just one): >> >> 1. query1 of any content contains Pepsi yielded 228 hits, how could a >> >> more >> >> limiting query2 (give me all docs that have Pepsi in it with a domain >> of >> >> ca) >> >> yield more hits (398)? >> >> 2. Since there are 212 hits of Canadian domains, how can query2 return >> >> 398 >> >> hits? >> >> >> >> Thanks for any pointers! >> >> Cheers, >> >> student_t >> >> >> >> >> >> -- >> >> View this message in context: >> >> >> http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19695313.html >> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> >> >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19697605.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > -- View this message in context: http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19725730.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]