hi Murat, i think that happens already. I might be wrong.
A quick scan led me to this class org.apache.nutch.parse.html.DOMContentUtils have a look around, also consider looking at HTMLParser.java in nutch. I havent looked at the NekoHTML source closely - but hope this helps. Fadzi On Thu, 2006-11-30 at 18:07 +0200, Murat Ali Bayir wrote: > Hi All, I want to ask question about NecoHTML parser that is used by > Nutch. I want to know > whether we can have textExtraction funtion extracting displayed data in > HTML documents > between <body> and </body> tags ? > > This textExtraction function can work like below: > > case 1: Assume that our html document is given as: > > <html> > <body> > > <a href="example.com"> this is an example </a> > > </body> > </html> > > > the textExtraction function returns the string "this is an example". for > case 1. > > <html> > <body> > > <a href="example.com"> </a> > > </body> > </html> > > > in this case textExtraction function returns null for case 2. > > Is anybody know how to perform that by using NecoHTML parser? >
