Richard Braman wrote:
Here is a potential algorithm:
Look first to Meta Description, if none exists
Look for continuous block of text, ignore content that doesn't contain a
continous block of text. If a given html tag only contains a few words
of text, it is not content , but rather a part of the nav structure of
the page.
You may potentially miss a lot of content this way, nowadays many pages
freely mix in markup in the main content area...
Here is yet another algorithm.
When fetching pages from a particular web, analyze the structure of the
page, try to make a determination of what content stays similar from
page to page within the same web. That would usually be menus, headers,
footers, etc.
This requires collecting pages in advance to train the structure
recognizer, and preparing "profiles" for groups of pages with common layout.
Granted the menus may change slightly from page to page, which is why
the algorithm would be pattern based instead of literal.
When you determine what is navigation and what is content, you would
only parse and index the content.
I think algortihm # 1 is what google uses.
google ignores content that does not change from page to page, as well
as content that isn't part of a pblock of text.
Comments please
The best way to evaluate this would be to ..erhm.. evaluate these
algorithms on a set of reference pages. Would you like to implement one
or both algorithms and test them?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers