Thank you Erick!

I got Luke and it's a great tool! I verified from Luke my queries posted
originally worked as expected (i.e., "Canadian pepsi" produced fewer results
than "pepsi" along.)

Based on your suggestion, I found out the program re-wrote the query before
it was sent to Nutch as the following:

(content:"+(content:pepsi) +((host:ca^10.0)^10.0)")

I think this is messing up the query results. Literally, this is what the
code was doing:

Query nutchQuery = new Query();
nutchQuery.addRequiredTerm("+(content:pepsi) +((host:ca^10.0)^10.0)",
"content");
nutchBean.search(nutchQuery, maxHits, maxHitsPerSite);

The above code must be using a default analyzer, which I have not found out
yet. But after looking at Luke results and this code, I think my problem is
in the query construction. Could you shed some light about how the new query
be constructed based on the following String?

+(content:pepsi) +((host:ca^10.0)^10.0)

Thanks again!


Erick Erickson wrote:
> 
> For Luke, see http://www.getopt.org/luke/ or just google lucene luke
> 
> About the same analyzers at index and search. No, it's not *required*
> that you use the same analyzer, but unless and until you understand
> what analyzers do, I would *strongly* recommend that you use the
> same one. See PerFieldAnalyzerWrapper for using different analyzers
> for different fields....
> 
> Conceptually, think of an analyzer as giving you the "smallest meaningful
> token" from your stream that your program then does something with,
> either index or analyze.
> 
> Consider the line "this is the 2nd Line in Erick's program". What are the
> tokens?
> Well, depending on the analyzer you could get
> nd, line, erick, program (assuming your analyzer lowercases, strips
> numbers
> and punctuation and removes stopwords)
> 
> You could get
> "this is the 2nd Line in Erick's program" (assuming your analyzer did
> nothing, just returned the entire input stream as a token, which would
> then
> NOT get a hit searching, say, 'Line')
> 
> You could get
> this, is, the, 2nd, line, in erick's, program (assuming your analyzer just
> lowercased and split on whitespace)
> 
> You could get
> nonsense, nonsense, nonsense, nonsense (assuming you wrote a pathological
> analyzer that always returned the word "nonsense")
> 
> Really, any transformations you wanted to perform can be done in an
> analyzer.
> So using different analyzers during indexing and searching will very often
> produce "surprising" results. Don't do it unless you know exactly what's
> being
> done to your input streams. Or be prepared to spend a lot of hours
> tracking
> down "what went wrong" <G>.
> 
> All that said, it doesn't explain the behaviour you're seeing because
> you're
> right,
> adding successively more restrictions should produce smaller result
> counts.
> I suspect
> that if you see what actual queries you're generating (and when you get
> Luke, be
> sure to find the drop-down for "which analyzer" to use) you'll find
> something
> surprising.
> 
> Best
> Erick
> 
> 
> On Fri, Sep 26, 2008 at 6:57 PM, student_t <[EMAIL PROTECTED]> wrote:
> 
>>
>> Hi Eric,
>>
>> Thanks a bunch for your pointers. I will need to find out the analyzers
>> at
>> index and query time. But is it critical to have the same analyzers
>> during
>> these two times?
>>
>> I had tested with lucli from some of my local segment data and they
>> appeared
>> working fine (i.e., their result sets are reasonable.)
>>
>> Is Luke part of Lucene contrib? I recall there is a GUI that lets you
>> view
>> the indices. Would you please elaborate?
>>
>> Thanks again!
>> student_t
>>
>>
>> Erick Erickson wrote:
>> >
>> > That certainly doesn't look right. What analyzers are you using at
>> index
>> > and query time?
>> >
>> > Two things that will help track down what's really happening:
>> >
>> > 1> query.toString() is your friend.
>> > 2> get a copy of the excellent Luke tool and have it do its explain
>> magic
>> > on
>> > your query. Watch that the analyzer you choose when querying is what
>> you
>> > expect....
>> >
>> > If neither of those things sheds any light on the problem, let us know
>> > what
>> > you find....
>> >
>> > Best
>> > Erick
>> >
>> > On Fri, Sep 26, 2008 at 3:55 PM, student_t <[EMAIL PROTECTED]> wrote:
>> >
>> >>
>> >> I am baffled by the results of the following queries. Can it be
>> something
>> >> to
>> >> do with the boosting factor? All of these queries are performed in the
>> >> same
>> >> environment with the same crawled index/data.
>> >>
>> >> A. query1 = +(content:(Pepsi))                              resulted
>> in
>> >> 228
>> >> hits.
>> >> B. query2 = +(content:(Pepsi) ) +(host:(ca)^10 )     resulted in 398
>> >> hits.
>> >> C. query3 = +(host:(ca)^10 )                                resulted
>> in
>> >> 212
>> >> hits.
>> >>
>> >> Two questions (strictly just one):
>> >> 1. query1 of any content contains Pepsi yielded 228 hits, how could a
>> >> more
>> >> limiting query2 (give me all docs that have Pepsi in it with a domain
>> of
>> >> ca)
>> >> yield more hits (398)?
>> >> 2. Since there are 212 hits of Canadian domains, how can query2 return
>> >> 398
>> >> hits?
>> >>
>> >> Thanks for any pointers!
>> >> Cheers,
>> >> student_t
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19695313.html
>> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19697605.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19725730.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to