Re: how to crawl all the urls in the page

kevin pang Tue, 11 Nov 2008 23:24:11 -0800

Cool,

Thanks for your reply.
I want to know which point will this extension extends?
thanks in advance.


2008/11/10 Cool The Breezer <[EMAIL PROTECTED]>

> Create a new Nutch extension to add a new field to Document which contains
> all text for all links available in a page. Take a look at NekoHTML or
> HTMLParser documents and get all links of any page. And extract texts for
> all links. Then add a new field to nutch document.
>
> I had same kind of requirement to get all image URLs from page and add them
> as a new field in Nutch document. I have used htmlparser to extract all
> images and converted the URLs  as comma separated text and  added them as a
> new field in index.
>
> - RB
>
>
> --- On Sun, 11/9/08, kevin pang <[EMAIL PROTECTED]> wrote:
>
> > From: kevin pang <[EMAIL PROTECTED]>
> > Subject: how to crawl all the urls in the page
> > To: [email protected]
> > Date: Sunday, November 9, 2008, 9:28 PM
> > i want to crawl all the urls in the page including those
> > display as text,not
> > just as hyper link, how to add this rule into nutch fetcher
> > ?
> > anyone can help ? much appriciated.
> >
> > Regards,
>
>
>
>

Re: how to crawl all the urls in the page

Reply via email to