I'd agree that (2) is quite important for the end user; Richard's continuous text heuristic may actually work for that. I'd extend the meaning of "continuous block" to ignore inline tags such as SPAN, I, B, TT etc, so only certain tags would actually break the content into chunks. Snippets then would be generated from these chunks alone, ignoring the rest of the content. If this heuristic is applied only at snippet-generation time then Andrzej's concern about missing content is not relevant anymore.

Hmm... I'm not convinced. How would you generate the best snippet from a relevant, but ignored chunk?

Maybe eventually this could be the start of using tags to boost
certain sections of the page as Google probably does. Normal
text blocks would have a boost of 1.0, while stuff within <B>, <H*>
might be boosted by 1.5. Stuff within suspected navigation text
could be de-boosted by 0.25 or something. Maybe that would
be a more appropriate way of handling relevance of navigation
text. It should have some relevance, but not as much as content.

Maybe the summary text could somehow ignore the de-boosted
sections to improve readability unless the content doesn't have
a better match. You basically construct a snippet giving preference
according to the boost value of the section of text.

This all sounds like a lot of work though :)

Howie


Reply via email to