Re: how to crawl all the urls in the page

kevin pang Sun, 16 Nov 2008 02:14:03 -0800

Thanks again.
For crawler to crawl the text format urls, do I only need to add the two
classes which add parsed urls to the new index field ?
any action to do to add the urls into url fetchlist?
thanks!


2008/11/12 Cool The Breezer <[EMAIL PROTECTED]>

> You might want to add it for mime time text/html
>
> To help you further,
> I have a class that extends
> public class MyImageParserParser implements HtmlParseFilter{
> //I parse the content and extract the img urls add them in meta data
>
> Metadata metaData = parse.getData().getParseMeta();
> metaData.add........
> }
>
> There is one more filter class, where I extract all values from meta data
> and add them to document
> MyImageFilter implements IndexingFilter{
> //I update the Document here by adding img values
>
> doc.add(new Field(IFields.IMG_DESCRIPTION, imgdescriptions,
> Field.Store.YES, Field.Index.TOKENIZED));
>
> }
>
> Thats it, all fileds will automatically get indexed
>
> --- On Wed, 11/12/08, Alexander Aristov <[EMAIL PROTECTED]>
> wrote:
>
> > From: Alexander Aristov <[EMAIL PROTECTED]>
> > Subject: Re: how to crawl all the urls in the page
> > To: [email protected]
> > Date: Wednesday, November 12, 2008, 2:58 AM
> > I think the pagewill help you much
> >
> > http://wiki.apache.org/nutch/WritingPluginExample-0.9
> >
> >
> > Alex
> > 2008/11/12 kevin pang <[EMAIL PROTECTED]>
> >
> > > Cool,
> > >
> > > Thanks for your reply.
> > > I want to know which point will this extension
> > extends?
> > > thanks in advance.
> > >
> > > 2008/11/10 Cool The Breezer
> > <[EMAIL PROTECTED]>
> > >
> > > > Create a new Nutch extension to add a new field
> > to Document which
> > > contains
> > > > all text for all links available in a page. Take
> > a look at NekoHTML or
> > > > HTMLParser documents and get all links of any
> > page. And extract texts for
> > > > all links. Then add a new field to nutch
> > document.
> > > >
> > > > I had same kind of requirement to get all image
> > URLs from page and add
> > > them
> > > > as a new field in Nutch document. I have used
> > htmlparser to extract all
> > > > images and converted the URLs  as comma separated
> > text and  added them as
> > > a
> > > > new field in index.
> > > >
> > > > - RB
> > > >
> > > >
> > > > --- On Sun, 11/9/08, kevin pang
> > <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > From: kevin pang <[EMAIL PROTECTED]>
> > > > > Subject: how to crawl all the urls in the
> > page
> > > > > To: [email protected]
> > > > > Date: Sunday, November 9, 2008, 9:28 PM
> > > > > i want to crawl all the urls in the page
> > including those
> > > > > display as text,not
> > > > > just as hyper link, how to add this rule
> > into nutch fetcher
> > > > > ?
> > > > > anyone can help ? much appriciated.
> > > > >
> > > > > Regards,
> > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Best Regards
> > Alexander Aristov
>
>
>
>

Re: how to crawl all the urls in the page

Reply via email to