Thanks for the tips, Andy and Chirag! It saves me a lot of trouble.
I'll tweak the boosting for anchors and re-index and see where it
gets me.
Thanks,
Howie
Concur with Andy on both points -- Unfortunately, there is no way to "go
back" and remove either of these values without reindexing, so let me save
you the trouble if you were thinking of changing the similarity class as a
workaround.
IMO, the problem with anchors is that you either need to get them all, or
not get them at all -- getting just a few anchors can give you really bad
results as stuff like "click here" will give pages a high score that don't
contain either of these terms. Another approach is to go in the properties
file and change the boost of anchors to 0.05, thus giving them a very very
low boost
Regarding the norm -- this is done at index time for each field. We've
changed the indexing code so that it's always 1
HTH,
CC
-----Original Message-----
From: Andy Liu [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 03, 2005 8:00 AM
To: [email protected]
Subject: Re: Strange search results
The fieldNorm is lengthNorm * document boost. The final value is "rounded"
so that's why you're getting such clean numbers for your fieldNorm. If
you're finding that these pages have too high of a boost, you can lower
indexer.score.power in your conf file.
As for your problem in #2, look at the explain page to see how your search
result got there. Maybe there's a high score for an anchor match. The
anchor text doesn't show up on the text of the page, so maybe that's it.
Andy
On 8/3/05, Howie Wang <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I've been noticing some strange search results recently. I seem to be
> getting two issues.
>
> 1. The fieldNorm for certain terms is unusually high for certain sites
> for anchors and titles. And they are usually just whole numbers (4.0,
> 5.0, etc).
> I find this strange since the lengthNorm used to calculate this is
> very unlikely to result in an integer. It's either 1/sqrt(numTokens)
> or 1/log(e+numTokens). Where is 5.0 coming from?
>
> 2. I'm getting hits for sites that don't contain ANY of the terms in
> my search. This is exacerbated by issue #1 since the fieldNorm boosts
> this page to the top of the results. I thought it might be because of
> my changes for stemming, but this happens for search terms that are
> not changed by stemming at all.
>
> Anyone run into something like this? Any ideas on how to start
debugging?
>
> Thanks,
> Howie
>
>
> Howie
>
>
>
-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers