Andrzej Bialecki wrote:
* the DefaultSimilarity seems to excessively favor small lengths of "content" (high tf) and anchor texts (too high boost value?).

NutchSimilarity.lengthNorm() penalize short content by considering all documents with less than 1000 content tokens to be normalized as though they have 1000 content tokens. Is this not sufficient?


Note that anchor boosting is confounded with lengthNorm() for anchors, which uses 1/log(#tokens) rather than 1/sqrt(#tokens) in order to more favor pages with more anchors.

* title is not indexed nor tokenized, but quite often contains query terms. Currently the title is treated as one of the anchors. IMHO the title is more important, and it should be made into a separate indexed and tokenized field and the default query translator (BasicQueryFilter) should take this into account.

I don't object to indexing titles in a separate field. They can be high quality, but they can also be spammed more easily than anchors. In any case, separately controlling their boost, length normalization, etc. is probably a good idea.


* for the url field it's not the same whether the query terms occur in the domain name, or in the file path name in the url. The former is usually more important, because it's more likely to point to a referebce site, and IMHO should be boosted separately. The latter usually indicates a reference page. We could differentiate between the two by adding a "domain" field as unstored, tokenized and indexed field, and to modify the BasicQueryFilter accordingly to use this field in order to boost up reference sites.

I'm not exactly sure how you'd use this. Why not just boost pages at reference sites? That does not require a new field.


Also, to offer more flexibility in searching I would propose to index the values of primaryType and secondaryType. This would enable searching for content of specific mime type. Currently these fields are only stored, but not indexed.

I think John recently added a plugin that does this, right?

Doug


------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to