Re: extracting displayed data of body tag in HTML documents

Gal Nitzan Sat, 02 Dec 2006 13:13:31 -0800

Since I am unable to elaborate... too much on this...

I can give you some pointers:


1. The place to do what you want is a parse filter.
2. You could replace the HtmlParser and work directly with neko (we use
TagSoup). However, any change that is done to the HtmlParser by commiters will
have to be manually edited.
3. The other option is to write a parse filter where you could convert the
content back to DOM and than use a tool like Jelly to parse and run XPath to
search the tree and extract data.

The 3rd option is my favorite.

HTH.




------ Original Message ------
Received: Thu, 30 Nov 2006 06:07:04 PM IST
From: Murat Ali Bayir <[EMAIL PROTECTED]>
To: [email protected]
Subject: extracting displayed data of body tag in HTML documents

> Hi All, I want to ask question about NecoHTML parser that is used by
> Nutch. I want to know
> whether we can have textExtraction funtion extracting displayed data in
> HTML documents
> between <body> and </body> tags ?
> 
> This textExtraction function can work like below:
> 
> case 1: Assume that our html document is given as:
> 
> <html>
> <body>
> 
> <a href="example.com"> this is an example </a>
> 
> </body>
> </html>
> 
> 
> the textExtraction function returns the string "this is an example". for
> case 1.
> 
> <html>
> <body>
> 
> <a href="example.com"> </a>
> 
> </body>
> </html>
> 
> 
> in this case textExtraction function returns null for case 2.
> 
> Is anybody know how to perform that by using NecoHTML parser?
> 
>

Re: extracting displayed data of body tag in HTML documents

Reply via email to