While on the topic guys, if you require another weighting scheme than the default one, will a re-indexing really be necessary? I'm currently trying to search just some of the fields. For instance, I'd like to base the hits entirely on the page title, not by anchor text, contents or other factors. I thought this would be a matter of hacking the searcher-part of Nutch, not the index, but I haven't figured it out yet. Any wise words on this problem?
Fredrik On 8/3/05, Howie Wang <[EMAIL PROTECTED]> wrote: > Thanks for the tips, Andy and Chirag! It saves me a lot of trouble. > I'll tweak the boosting for anchors and re-index and see where it > gets me. > > Thanks, > Howie > > > >Concur with Andy on both points -- Unfortunately, there is no way to "go > >back" and remove either of these values without reindexing, so let me save > >you the trouble if you were thinking of changing the similarity class as a > >workaround. > > > >IMO, the problem with anchors is that you either need to get them all, or > >not get them at all -- getting just a few anchors can give you really bad > >results as stuff like "click here" will give pages a high score that don't > >contain either of these terms. Another approach is to go in the properties > >file and change the boost of anchors to 0.05, thus giving them a very very > >low boost > > > >Regarding the norm -- this is done at index time for each field. We've > >changed the indexing code so that it's always 1 > > > >HTH, > >CC > > > > > >-----Original Message----- > >From: Andy Liu [mailto:[EMAIL PROTECTED] > >Sent: Wednesday, August 03, 2005 8:00 AM > >To: [email protected] > >Subject: Re: Strange search results > > > >The fieldNorm is lengthNorm * document boost. The final value is "rounded" > >so that's why you're getting such clean numbers for your fieldNorm. If > >you're finding that these pages have too high of a boost, you can lower > >indexer.score.power in your conf file. > > > >As for your problem in #2, look at the explain page to see how your search > >result got there. Maybe there's a high score for an anchor match. The > >anchor text doesn't show up on the text of the page, so maybe that's it. > > > >Andy > > > >On 8/3/05, Howie Wang <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > > > I've been noticing some strange search results recently. I seem to be > > > getting two issues. > > > > > > 1. The fieldNorm for certain terms is unusually high for certain sites > > > for anchors and titles. And they are usually just whole numbers (4.0, > > > 5.0, etc). > > > I find this strange since the lengthNorm used to calculate this is > > > very unlikely to result in an integer. It's either 1/sqrt(numTokens) > > > or 1/log(e+numTokens). Where is 5.0 coming from? > > > > > > 2. I'm getting hits for sites that don't contain ANY of the terms in > > > my search. This is exacerbated by issue #1 since the fieldNorm boosts > > > this page to the top of the results. I thought it might be because of > > > my changes for stemming, but this happens for search terms that are > > > not changed by stemming at all. > > > > > > Anyone run into something like this? Any ideas on how to start > >debugging? > > > > > > Thanks, > > > Howie > > > > > > > > > Howie > > > > > > > > > > > > > > > > ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
