Doug Cutting wrote:
NutchSimilarity.lengthNorm() penalize short content by considering all documents with less than 1000 content tokens to be normalized as though they have 1000 content tokens. Is this not sufficient?
Not in my experience. Please consider the following hits (attached in a file), ordered by score, which I've got from a 5mln pages index of mostly Swedish sites, for a query "apoteket" ("the pharmacy" in Swedish). There is clearly something very wrong with the second hit.
Yes. If that were a "title" match (which it really is), and titles were boosted less than anchors, then this would probably be third or lower.
I don't object to indexing titles in a separate field. They can be high quality, but they can also be spammed more easily than anchors. In any case, separately controlling their boost, length normalization, etc. is probably a good idea.
Ok, I'll prepare a patch for review.
Great! I'm glad more folks are looking at search result quality. This is very important, and not simple.
Example: all other things being equal (i.e. the content and anchors), which url seems to be more representative for the query "ikea":
http://www.ikea.se/something/else.html http://www.something.se/else/ikea.html
IMHO the first url should be given a higher score. Currently they get the same score.
Agreed. This argues for "host" as a separate indexed field.
Doug
------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
