Re: Template Detection?

2009-03-23 Thread Andrzej Bialecki

dealmaker wrote:

Hi,
  Does Nutch or any plugin have the template detection?  It seems that
navigation and footer sections usually distort the ranking of search
results.  Is there already open source project or code that I can integrate
to Nutch to give it the ability of template detection?
Thanks.


There is no ready-made component in Nutch for this task. The task itself 
is complicated and there are no ideal solutions. There are several 
algorithms described in the literature, primarily falling into two 
groups: page-at-a-time (usually single pass) and whole-corpus (usually 
several passes). They work with varying degrees of success, strongly 
dependent on the test corpus.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Template Detection?

2009-03-23 Thread dealmaker

is there any substitution to Template Detection?  Any easy hack or
already-made plugins or open source projects that can improve the search
results in certain degree without template detection?
Thanks.


Andrzej Bialecki wrote:
 
 dealmaker wrote:
 Hi,
   Does Nutch or any plugin have the template detection?  It seems that
 navigation and footer sections usually distort the ranking of search
 results.  Is there already open source project or code that I can
 integrate
 to Nutch to give it the ability of template detection?
 Thanks.
 
 There is no ready-made component in Nutch for this task. The task itself 
 is complicated and there are no ideal solutions. There are several 
 algorithms described in the literature, primarily falling into two 
 groups: page-at-a-time (usually single pass) and whole-corpus (usually 
 several passes). They work with varying degrees of success, strongly 
 dependent on the test corpus.
 
 
 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Template-Detection--tp22655736p22661543.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Template Detection?

2009-03-23 Thread Andrzej Bialecki

dealmaker wrote:

is there any substitution to Template Detection?  Any easy hack or
already-made plugins or open source projects that can improve the search
results in certain degree without template detection?


A simple method is to do the following:

* prepare a simplified DOM tree where you remove all styling information 
- i.e. leave just the structural tags and plain text, and a tags.


* for each block of text inside a structural tag count its outlinks 
(i.e. a tags) and measure its size (in characters or better yet in words).


* then go through the list of all blocks, and if a block size is smaller 
than a threshold (relative or absolute), AND it contains relatively high 
number of outlinks, discard it.


The remaining blocks can be merged together to form a cleaned-up body of 
 the page.


This method works quite well for a broad range of pages - and it fails 
miserably for portal-type pages that consist of many pagelets.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com