Dawid Weiss wrote:
It seems to me that there are two separate problems:
1) content parsing to avoid site structure -> influences the index and
rankings
2) content parsing for KWIC snippet generation -> influences the user
perception of the engine.
I'd agree that (2) is quite important for the end user; Richard's
continuous text heuristic may actually work for that. I'd extend the
meaning of "continuous block" to ignore inline tags such as SPAN, I,
B, TT etc, so only certain tags would actually break the content into
chunks. Snippets then would be generated from these chunks alone,
ignoring the rest of the content. If this heuristic is applied only at
snippet-generation time then Andrzej's concern about missing content
is not relevant anymore.
Hmm... I'm not convinced. How would you generate the best snippet from a
relevant, but ignored chunk?
But I agree that for some (perhaps large) percentage of sites this
heuristic could work well, and it's simple enough to be easily implemented.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers