dealmaker wrote:
is there any substitution to Template Detection?  Any easy hack or
already-made plugins or open source projects that can improve the search
results in certain degree without template detection?

A simple method is to do the following:

* prepare a simplified DOM tree where you remove all styling information - i.e. leave just the structural tags and plain text, and <a> tags.

* for each block of text inside a structural tag count its outlinks (i.e. <a> tags) and measure its size (in characters or better yet in words).

* then go through the list of all blocks, and if a block size is smaller than a threshold (relative or absolute), AND it contains relatively high number of outlinks, discard it.

The remaining blocks can be merged together to form a cleaned-up body of the page.

This method works quite well for a broad range of pages - and it fails miserably for portal-type pages that consist of many "pagelets".

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to