dealmaker wrote:
is there any substitution to Template Detection? Any easy hack or
already-made plugins or open source projects that can improve the search
results in certain degree without template detection?
A simple method is to do the following:
* prepare a simplified DOM tree where you remove all styling information
- i.e. leave just the structural tags and plain text, and <a> tags.
* for each block of text inside a structural tag count its outlinks
(i.e. <a> tags) and measure its size (in characters or better yet in words).
* then go through the list of all blocks, and if a block size is smaller
than a threshold (relative or absolute), AND it contains relatively high
number of outlinks, discard it.
The remaining blocks can be merged together to form a cleaned-up body of
the page.
This method works quite well for a broad range of pages - and it fails
miserably for portal-type pages that consist of many "pagelets".
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com