Thanks for that explanation Chirag, that was what I was looking for. I use a pretty pimped up segment and index - I'm not using Nutch for traditional webpages but for accessing other types of data. So I will still want to index all of the fields, it's just that a particular search should apply only to certain attributes of the data. I'll just "deboost" the fields I'm currently not interested in, that will work just fine : )
Big thanks, Fredrik On 8/3/05, Chirag Chaman <[EMAIL PROTECTED]> wrote: > You should be able to do that by simply changing the boosts in the nutch > properties file. > Change your title boost to 3 or 4 and bring down all the other boosts to > something less than 1. > > Re-indexing is not necessary. You only need to re-index if you want to > change the boost in the norm field (NOTE: this boost is DIFFERENT from the > query boost) which is encode into the field and multiplied with the score -- > the query boost is then multiplied to this further. > > The only problem I see is that you don't want to index anything by content > -- for that you will need to change the query to not look in that field or > give that a very low boost as well (anything between 0 and 1 is a negative > boost). AFAIK, to change the content part you will need to modify the query > code. > > > > > -----Original Message----- > From: Fredrik Andersson [mailto:[EMAIL PROTECTED] > Sent: Wednesday, August 03, 2005 2:23 PM > To: [email protected] > Subject: Re: Strange search results > > While on the topic guys, if you require another weighting scheme than the > default one, will a re-indexing really be necessary? I'm currently trying to > search just some of the fields. For instance, I'd like to base the hits > entirely on the page title, not by anchor text, contents or other factors. I > thought this would be a matter of hacking the searcher-part of Nutch, not > the index, but I haven't figured it out yet. Any wise words on this problem? > > Fredrik > > On 8/3/05, Howie Wang <[EMAIL PROTECTED]> wrote: > > Thanks for the tips, Andy and Chirag! It saves me a lot of trouble. > > I'll tweak the boosting for anchors and re-index and see where it gets > > me. > > > > Thanks, > > Howie > > > > > > >Concur with Andy on both points -- Unfortunately, there is no way to > > >"go back" and remove either of these values without reindexing, so > > >let me save you the trouble if you were thinking of changing the > > >similarity class as a workaround. > > > > > >IMO, the problem with anchors is that you either need to get them > > >all, or not get them at all -- getting just a few anchors can give > > >you really bad results as stuff like "click here" will give pages a > > >high score that don't contain either of these terms. Another > > >approach is to go in the properties file and change the boost of > > >anchors to 0.05, thus giving them a very very low boost > > > > > >Regarding the norm -- this is done at index time for each field. > > >We've changed the indexing code so that it's always 1 > > > > > >HTH, > > >CC > > > > > > > > >-----Original Message----- > > >From: Andy Liu [mailto:[EMAIL PROTECTED] > > >Sent: Wednesday, August 03, 2005 8:00 AM > > >To: [email protected] > > >Subject: Re: Strange search results > > > > > >The fieldNorm is lengthNorm * document boost. The final value is > "rounded" > > >so that's why you're getting such clean numbers for your fieldNorm. > > >If you're finding that these pages have too high of a boost, you can > > >lower indexer.score.power in your conf file. > > > > > >As for your problem in #2, look at the explain page to see how your > > >search result got there. Maybe there's a high score for an anchor > > >match. The anchor text doesn't show up on the text of the page, so maybe > that's it. > > > > > >Andy > > > > > >On 8/3/05, Howie Wang <[EMAIL PROTECTED]> wrote: > > > > Hi, > > > > > > > > I've been noticing some strange search results recently. I seem to > > > > be getting two issues. > > > > > > > > 1. The fieldNorm for certain terms is unusually high for certain > > > > sites for anchors and titles. And they are usually just whole > > > > numbers (4.0, 5.0, etc). > > > > I find this strange since the lengthNorm used to calculate this is > > > > very unlikely to result in an integer. It's either > > > > 1/sqrt(numTokens) or 1/log(e+numTokens). Where is 5.0 coming from? > > > > > > > > 2. I'm getting hits for sites that don't contain ANY of the terms > > > > in my search. This is exacerbated by issue #1 since the fieldNorm > > > > boosts this page to the top of the results. I thought it might be > > > > because of my changes for stemming, but this happens for search > > > > terms that are not changed by stemming at all. > > > > > > > > Anyone run into something like this? Any ideas on how to start > > >debugging? > > > > > > > > Thanks, > > > > Howie > > > > > > > > > > > > Howie > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
