Re: extracting displayed data of body tag in HTML documents

Fadzi Ushewokunze Sat, 02 Dec 2006 17:51:06 -0800

hi Murat,

i think that happens already. I might be wrong.


A quick scan led me to this class
org.apache.nutch.parse.html.DOMContentUtils

have a look around, also consider looking at HTMLParser.java in nutch.

I havent looked at the NekoHTML source closely - but hope this helps.

Fadzi




On Thu, 2006-11-30 at 18:07 +0200, Murat Ali Bayir wrote:
> Hi All, I want to ask question about NecoHTML parser that is used by
> Nutch. I want to know
> whether we can have textExtraction funtion extracting displayed data in
> HTML documents
> between <body> and </body> tags ?
> 
> This textExtraction function can work like below:
> 
> case 1: Assume that our html document is given as:
> 
> <html>
> <body>
> 
> <a href="example.com"> this is an example </a>
> 
> </body>
> </html>
> 
> 
> the textExtraction function returns the string "this is an example". for
> case 1.
> 
> <html>
> <body>
> 
> <a href="example.com"> </a>
> 
> </body>
> </html>
> 
> 
> in this case textExtraction function returns null for case 2.
> 
> Is anybody know how to perform that by using NecoHTML parser?
>

Re: extracting displayed data of body tag in HTML documents

Reply via email to