Re: How to ignore search results that don't have related keywords in main body?

Andrzej Bialecki Sat, 10 Oct 2009 09:22:26 -0700

BELLINI ADAM wrote:

hi guyes.... it's just what im talking about in my post 'indexing
just certain content‏'... you can read it mabe it could help you... i
was asking how to get rid of the garbage sections in a document and
to parse only the important data...so i guess you will create your
own parser and indexer...but the problem is how could we delete those
garbage section from an html...try to read my post...mabe we can
gather our two posts...i dont know if we can gather posts on thsi
mailing list...to keep tracking only one post...

What is garbage? Can you define it in terms of regex pattern or XPathexpression that points to specific elements in DOM tree? If you crawl asingle (or few) sites with well defined templates then you can hardcodesome rules for removing unwanted parts of the page.

If you can't do this, then there are some heuristic methods to solvethis. There are two groups of methods:

* page at a time (local): this group of methods considers only thecurrent page that you analyze. The quality of filtering is usually limited.

* groups of pages (e.g. per site): these methods consider many pages ata time, and try to find recurring theme among them. Since you first needto accumulate some pages it can't be done on the fly, i.e. this requiresa separate post-processing step.

The easiest to implement in Nutch is the first approach (page at atime). There are many possible implementations - e.g. based on textpatterns, on visual position of elements, on DOM tree patterns, on"block of content" characteristics, etc.


Here's for example a simple method:

* collect text from the page in blocks, where each block fits withinstructural tags (div and table tags). Collect also the number of <a>links in each block.

* remove a percentage of the smallest blocks, where link number is high- these are likely navigational elements.


* reconstruct the whole page from the remaining blocks.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: How to ignore search results that don't have related keywords in main body?

Reply via email to