Re: how to crawl all the urls in the page

Cool The Breezer Sun, 09 Nov 2008 22:28:48 -0800

Create a new Nutch extension to add a new field to Document which contains all 
text for all links available in a page. Take a look at NekoHTML or HTMLParser 
documents and get all links of any page. And extract texts for all links. Then 
add a new field to nutch document.

I had same kind of requirement to get all image URLs from page and add them as 
a new field in Nutch document. I have used htmlparser to extract all images and 
converted the URLs  as comma separated text and  added them as a new field in 
index.

- RB

--- On Sun, 11/9/08, kevin pang <[EMAIL PROTECTED]> wrote:

> From: kevin pang <[EMAIL PROTECTED]>
> Subject: how to crawl all the urls in the page
> To: [email protected]
> Date: Sunday, November 9, 2008, 9:28 PM
> i want to crawl all the urls in the page including those
> display as text,not
> just as hyper link, how to add this rule into nutch fetcher
> ?
> anyone can help ? much appriciated.
> 
> Regards,

Re: how to crawl all the urls in the page

Reply via email to