Hello,

I noticed that summaries produced by Summarizer contain enormous amounts of whitespace. It probably comes from the HTML -> text parser (?). I propose that it should be trimmed out when producing summaries, i.e. there should be just a single space left between tokens. This should result in better performance when working with search results.

Regarding the query and scoring: I noticed that "title" field is not indexed and not tokenized, and neither it is used in the query (see QueryTranslator). This strikes me as somewhat strange - quite often the title of a document contains relevant keywords. It would be also nice to have the ability to constrain the search by field, e.g. "url:www.ikea.se".

Also, in the QueryTranslator currently the same boost value is used for sloppy and exact phrases. My intuition suggests that exact phrases should get a higher boost (but my intuition may be wrong... :-) )

I also propose the following enhancement: that the scoring "knobs" be initialized from the config file. This way it will be much easier to experiment with various settings - currently re-compilation is required to change them. It would be nice to have it the same way also in NutchSimilarity.

Finally: from my observations it seems that the lengthNorm for "url" should be much lower, i.e. it should much strongly prefer shorter urls. The reason for this is that currently quite often if you search for a url (e.g. "www.ikea.se") the exact hit is listed only at n-th place, instead of the top.

Any comments appreciated.

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)




------------------------------------------------------- This SF.Net email is sponsored by: SourceForge.net Broadband Sign-up now for SourceForge Broadband and get the fastest 6.0/768 connection for only $19.95/mo for the first 3 months! http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to