Since I am unable to elaborate... too much on this... I can give you some pointers:
1. The place to do what you want is a parse filter. 2. You could replace the HtmlParser and work directly with neko (we use TagSoup). However, any change that is done to the HtmlParser by commiters will have to be manually edited. 3. The other option is to write a parse filter where you could convert the content back to DOM and than use a tool like Jelly to parse and run XPath to search the tree and extract data. The 3rd option is my favorite. HTH. ------ Original Message ------ Received: Thu, 30 Nov 2006 06:07:04 PM IST From: Murat Ali Bayir <[EMAIL PROTECTED]> To: [email protected] Subject: extracting displayed data of body tag in HTML documents > Hi All, I want to ask question about NecoHTML parser that is used by > Nutch. I want to know > whether we can have textExtraction funtion extracting displayed data in > HTML documents > between <body> and </body> tags ? > > This textExtraction function can work like below: > > case 1: Assume that our html document is given as: > > <html> > <body> > > <a href="example.com"> this is an example </a> > > </body> > </html> > > > the textExtraction function returns the string "this is an example". for > case 1. > > <html> > <body> > > <a href="example.com"> </a> > > </body> > </html> > > > in this case textExtraction function returns null for case 2. > > Is anybody know how to perform that by using NecoHTML parser? > >
