Create a new Nutch extension to add a new field to Document which contains all text for all links available in a page. Take a look at NekoHTML or HTMLParser documents and get all links of any page. And extract texts for all links. Then add a new field to nutch document.
I had same kind of requirement to get all image URLs from page and add them as a new field in Nutch document. I have used htmlparser to extract all images and converted the URLs as comma separated text and added them as a new field in index. - RB --- On Sun, 11/9/08, kevin pang <[EMAIL PROTECTED]> wrote: > From: kevin pang <[EMAIL PROTECTED]> > Subject: how to crawl all the urls in the page > To: [email protected] > Date: Sunday, November 9, 2008, 9:28 PM > i want to crawl all the urls in the page including those > display as text,not > just as hyper link, how to add this rule into nutch fetcher > ? > anyone can help ? much appriciated. > > Regards,
