Re: Template Detection?
dealmaker wrote: Hi, Does Nutch or any plugin have the template detection? It seems that navigation and footer sections usually distort the ranking of search results. Is there already open source project or code that I can integrate to Nutch to give it the ability of template detection? Thanks. There is no ready-made component in Nutch for this task. The task itself is complicated and there are no ideal solutions. There are several algorithms described in the literature, primarily falling into two groups: page-at-a-time (usually single pass) and whole-corpus (usually several passes). They work with varying degrees of success, strongly dependent on the test corpus. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Template Detection?
is there any substitution to Template Detection? Any easy hack or already-made plugins or open source projects that can improve the search results in certain degree without template detection? Thanks. Andrzej Bialecki wrote: dealmaker wrote: Hi, Does Nutch or any plugin have the template detection? It seems that navigation and footer sections usually distort the ranking of search results. Is there already open source project or code that I can integrate to Nutch to give it the ability of template detection? Thanks. There is no ready-made component in Nutch for this task. The task itself is complicated and there are no ideal solutions. There are several algorithms described in the literature, primarily falling into two groups: page-at-a-time (usually single pass) and whole-corpus (usually several passes). They work with varying degrees of success, strongly dependent on the test corpus. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- View this message in context: http://www.nabble.com/Template-Detection--tp22655736p22661543.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Template Detection?
dealmaker wrote: is there any substitution to Template Detection? Any easy hack or already-made plugins or open source projects that can improve the search results in certain degree without template detection? A simple method is to do the following: * prepare a simplified DOM tree where you remove all styling information - i.e. leave just the structural tags and plain text, and a tags. * for each block of text inside a structural tag count its outlinks (i.e. a tags) and measure its size (in characters or better yet in words). * then go through the list of all blocks, and if a block size is smaller than a threshold (relative or absolute), AND it contains relatively high number of outlinks, discard it. The remaining blocks can be merged together to form a cleaned-up body of the page. This method works quite well for a broad range of pages - and it fails miserably for portal-type pages that consist of many pagelets. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com