Re: Please help to interpret Lucene Boost results

Erick Erickson Sat, 27 Sep 2008 07:43:42 -0700

For Luke, see http://www.getopt.org/luke/ or just google lucene luke

About the same analyzers at index and search. No, it's not *required*
that you use the same analyzer, but unless and until you understand
what analyzers do, I would *strongly* recommend that you use the
same one. See PerFieldAnalyzerWrapper for using different analyzers
for different fields....

Conceptually, think of an analyzer as giving you the "smallest meaningful
token" from your stream that your program then does something with,
either index or analyze.

Consider the line "this is the 2nd Line in Erick's program". What are the
tokens?
Well, depending on the analyzer you could get
nd, line, erick, program (assuming your analyzer lowercases, strips numbers
and punctuation and removes stopwords)

You could get
"this is the 2nd Line in Erick's program" (assuming your analyzer did
nothing, just returned the entire input stream as a token, which would then
NOT get a hit searching, say, 'Line')

You could get
this, is, the, 2nd, line, in erick's, program (assuming your analyzer just
lowercased and split on whitespace)

You could get
nonsense, nonsense, nonsense, nonsense (assuming you wrote a pathological
analyzer that always returned the word "nonsense")

Really, any transformations you wanted to perform can be done in an
analyzer.
So using different analyzers during indexing and searching will very often
produce "surprising" results. Don't do it unless you know exactly what's
being
done to your input streams. Or be prepared to spend a lot of hours tracking
down "what went wrong" <G>.

All that said, it doesn't explain the behaviour you're seeing because you're
right,
adding successively more restrictions should produce smaller result counts.
I suspect
that if you see what actual queries you're generating (and when you get
Luke, be
sure to find the drop-down for "which analyzer" to use) you'll find
something
surprising.

Best
Erick

On Fri, Sep 26, 2008 at 6:57 PM, student_t <[EMAIL PROTECTED]> wrote:

>
> Hi Eric,
>
> Thanks a bunch for your pointers. I will need to find out the analyzers at
> index and query time. But is it critical to have the same analyzers during
> these two times?
>
> I had tested with lucli from some of my local segment data and they
> appeared
> working fine (i.e., their result sets are reasonable.)
>
> Is Luke part of Lucene contrib? I recall there is a GUI that lets you view
> the indices. Would you please elaborate?
>
> Thanks again!
> student_t
>
>
> Erick Erickson wrote:
> >
> > That certainly doesn't look right. What analyzers are you using at index
> > and query time?
> >
> > Two things that will help track down what's really happening:
> >
> > 1> query.toString() is your friend.
> > 2> get a copy of the excellent Luke tool and have it do its explain magic
> > on
> > your query. Watch that the analyzer you choose when querying is what you
> > expect....
> >
> > If neither of those things sheds any light on the problem, let us know
> > what
> > you find....
> >
> > Best
> > Erick
> >
> > On Fri, Sep 26, 2008 at 3:55 PM, student_t <[EMAIL PROTECTED]> wrote:
> >
> >>
> >> I am baffled by the results of the following queries. Can it be
> something
> >> to
> >> do with the boosting factor? All of these queries are performed in the
> >> same
> >> environment with the same crawled index/data.
> >>
> >> A. query1 = +(content:(Pepsi))                              resulted in
> >> 228
> >> hits.
> >> B. query2 = +(content:(Pepsi) ) +(host:(ca)^10 )     resulted in 398
> >> hits.
> >> C. query3 = +(host:(ca)^10 )                                resulted in
> >> 212
> >> hits.
> >>
> >> Two questions (strictly just one):
> >> 1. query1 of any content contains Pepsi yielded 228 hits, how could a
> >> more
> >> limiting query2 (give me all docs that have Pepsi in it with a domain of
> >> ca)
> >> yield more hits (398)?
> >> 2. Since there are 212 hits of Canadian domains, how can query2 return
> >> 398
> >> hits?
> >>
> >> Thanks for any pointers!
> >> Cheers,
> >> student_t
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19695313.html
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Please-help-to-interpret-Lucene-Boost-results-tp19695313p19697605.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Please help to interpret Lucene Boost results

Reply via email to