[Nutch-dev] Re: quality of search text

Andrzej Bialecki Sun, 12 Mar 2006 01:19:02 -0800

Dawid Weiss wrote:

It seems to me that there are two separate problems:
1) content parsing to avoid site structure -> influences the index andrankings2) content parsing for KWIC snippet generation -> influences the userperception of the engine.
I'd agree that (2) is quite important for the end user; Richard'scontinuous text heuristic may actually work for that. I'd extend themeaning of "continuous block" to ignore inline tags such as SPAN, I,B, TT etc, so only certain tags would actually break the content intochunks. Snippets then would be generated from these chunks alone,ignoring the rest of the content. If this heuristic is applied only atsnippet-generation time then Andrzej's concern about missing contentis not relevant anymore.

Hmm... I'm not convinced. How would you generate the best snippet from arelevant, but ignored chunk?

But I agree that for some (perhaps large) percentage of sites thisheuristic could work well, and it's simple enough to be easily implemented.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: quality of search text

Reply via email to