BELLINI ADAM wrote:
hi guyes.... it's just what im talking about in my post 'indexing
just certain content‏'... you can read it mabe it could help you... i
was asking how to get rid of the garbage sections in a document and
to parse only the important data...so i guess you will create your
own parser and indexer...but the problem is how could we delete those
garbage section from an html...try to read my post...mabe we can
gather our two posts...i dont know if we can gather posts on thsi
mailing list...to keep tracking only one post...

What is garbage? Can you define it in terms of regex pattern or XPath expression that points to specific elements in DOM tree? If you crawl a single (or few) sites with well defined templates then you can hardcode some rules for removing unwanted parts of the page.

If you can't do this, then there are some heuristic methods to solve this. There are two groups of methods:

* page at a time (local): this group of methods considers only the current page that you analyze. The quality of filtering is usually limited.

* groups of pages (e.g. per site): these methods consider many pages at a time, and try to find recurring theme among them. Since you first need to accumulate some pages it can't be done on the fly, i.e. this requires a separate post-processing step.

The easiest to implement in Nutch is the first approach (page at a time). There are many possible implementations - e.g. based on text patterns, on visual position of elements, on DOM tree patterns, on "block of content" characteristics, etc.

Here's for example a simple method:

* collect text from the page in blocks, where each block fits within structural tags (div and table tags). Collect also the number of <a> links in each block.

* remove a percentage of the smallest blocks, where link number is high - these are likely navigational elements.

* reconstruct the whole page from the remaining blocks.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to