Re: Template Detection?

Andrzej Bialecki Mon, 23 Mar 2009 09:35:12 -0700

dealmaker wrote:

is there any substitution to Template Detection?  Any easy hack or
already-made plugins or open source projects that can improve the search
results in certain degree without template detection?


A simple method is to do the following:

* prepare a simplified DOM tree where you remove all styling information- i.e. leave just the structural tags and plain text, and <a> tags.

* for each block of text inside a structural tag count its outlinks(i.e. <a> tags) and measure its size (in characters or better yet in words).

* then go through the list of all blocks, and if a block size is smallerthan a threshold (relative or absolute), AND it contains relatively highnumber of outlinks, discard it.

The remaining blocks can be merged together to form a cleaned-up body ofthe page.

This method works quite well for a broad range of pages - and it failsmiserably for portal-type pages that consist of many "pagelets".


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Template Detection?

Reply via email to