[Nutch-general] extracting displayed data of body tag in HTML documents

Murat Ali Bayir Thu, 30 Nov 2006 08:07:11 -0800

Hi All, I want to ask question about NecoHTML parser that is used by
Nutch. I want to know
whether we can have textExtraction funtion extracting displayed data in
HTML documents
between <body> and </body> tags ?


This textExtraction function can work like below:

case 1: Assume that our html document is given as:

<html>
<body>

<a href="example.com"> this is an example </a>

</body>
</html>


the textExtraction function returns the string "this is an example". for
case 1.

<html>
<body>

<a href="example.com"> </a>

</body>
</html>


in this case textExtraction function returns null for case 2.

Is anybody know how to perform that by using NecoHTML parser?

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] extracting displayed data of body tag in HTML documents

Reply via email to