Re: how to crawl all the urls in the page

Alexander Aristov Tue, 11 Nov 2008 23:59:02 -0800

I think the pagewill help you much

http://wiki.apache.org/nutch/WritingPluginExample-0.9



Alex
2008/11/12 kevin pang <[EMAIL PROTECTED]>

> Cool,
>
> Thanks for your reply.
> I want to know which point will this extension extends?
> thanks in advance.
>
> 2008/11/10 Cool The Breezer <[EMAIL PROTECTED]>
>
> > Create a new Nutch extension to add a new field to Document which
> contains
> > all text for all links available in a page. Take a look at NekoHTML or
> > HTMLParser documents and get all links of any page. And extract texts for
> > all links. Then add a new field to nutch document.
> >
> > I had same kind of requirement to get all image URLs from page and add
> them
> > as a new field in Nutch document. I have used htmlparser to extract all
> > images and converted the URLs  as comma separated text and  added them as
> a
> > new field in index.
> >
> > - RB
> >
> >
> > --- On Sun, 11/9/08, kevin pang <[EMAIL PROTECTED]> wrote:
> >
> > > From: kevin pang <[EMAIL PROTECTED]>
> > > Subject: how to crawl all the urls in the page
> > > To: [email protected]
> > > Date: Sunday, November 9, 2008, 9:28 PM
> > > i want to crawl all the urls in the page including those
> > > display as text,not
> > > just as hyper link, how to add this rule into nutch fetcher
> > > ?
> > > anyone can help ? much appriciated.
> > >
> > > Regards,
> >
> >
> >
> >
>



-- 
Best Regards
Alexander Aristov

Re: how to crawl all the urls in the page

Reply via email to