Andrzej Bialecki wrote:
I noticed that summaries produced by Summarizer contain enormous amounts of whitespace. It probably comes from the HTML -> text parser (?). I propose that it should be trimmed out when producing summaries, i.e. there should be just a single space left between tokens. This should result in better performance when working with search results.

I agree, that would be a great enhancement. I think it would be best to trim the whitespace when the text is first extracted. Please feel free to submit a patch.


Regarding the query and scoring: I noticed that "title" field is not indexed and not tokenized, and neither it is used in the query (see QueryTranslator). This strikes me as somewhat strange - quite often the title of a document contains relevant keywords. It would be also nice to have the ability to constrain the search by field, e.g. "url:www.ikea.se".

The title is in fact indexed, but in the anchor field. This was probably a premature optimization. I felt that searching another field was more expensive than the title warranted. When I have a chance I intend to fix this. This will probably happen when I do a major revision of the query translation and indexing code, which will probably be in the next few weeks.


Also, in the QueryTranslator currently the same boost value is used for sloppy and exact phrases. My intuition suggests that exact phrases should get a higher boost (but my intuition may be wrong... :-) )

The place to modify that would be to add a NutchSimilarity.sloppyFreq(int slop) method that gives a significantly larger value when slop=0. More generally, this is part of the Nutch parameter tuning, which is in it's infancy. All of the Similarity methods need to be tuned.


I also propose the following enhancement: that the scoring "knobs" be initialized from the config file. This way it will be much easier to experiment with various settings - currently re-compilation is required to change them. It would be nice to have it the same way also in NutchSimilarity.

I agree. Feel free to submit patches. I try to move parameters into config files whenever I can, but there are still lots more to go.


On the subject of config files: my current goal is to movve the regex url filter config from a separate file into the main config file, as a CDATA value. Long-term, all options should use the xml config mechanism, with default values in nutch-default.xml, site-specific overrides in nutch-site.xml, with, in some cases, application-specific overrides between (e.g., crawl-tool.xml).

Finally: from my observations it seems that the lengthNorm for "url" should be much lower, i.e. it should much strongly prefer shorter urls. The reason for this is that currently quite often if you search for a url (e.g. "www.ikea.se") the exact hit is listed only at n-th place, instead of the top.

Have you performed experiments that show this to be the issue with these rankings? I've worked a bit on this, and find that in most cases there's something else wrong (link-analysis gone amok or somesuch). If you have a replacement definition for URL lengthNorm, please send it along, ideally along with some "explain" outputs illustrating its merits.


Cheers,

Doug


------------------------------------------------------- This SF.Net email is sponsored by: SourceForge.net Broadband Sign-up now for SourceForge Broadband and get the fastest 6.0/768 connection for only $19.95/mo for the first 3 months! http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to