Re: how to crawl all the urls in the page

Cool The Breezer Wed, 12 Nov 2008 01:20:58 -0800

You might want to add it for mime time text/html

To help you further,
I have a class that extends 
public class MyImageParserParser implements HtmlParseFilter{
//I parse the content and extract the img urls add them in meta data


Metadata metaData = parse.getData().getParseMeta();
metaData.add........
}

There is one more filter class, where I extract all values from meta data and 
add them to document
MyImageFilter implements IndexingFilter{
//I update the Document here by adding img values

doc.add(new Field(IFields.IMG_DESCRIPTION, imgdescriptions, Field.Store.YES, 
Field.Index.TOKENIZED));

}

Thats it, all fileds will automatically get indexed

--- On Wed, 11/12/08, Alexander Aristov <[EMAIL PROTECTED]> wrote:

> From: Alexander Aristov <[EMAIL PROTECTED]>
> Subject: Re: how to crawl all the urls in the page
> To: [email protected]
> Date: Wednesday, November 12, 2008, 2:58 AM
> I think the pagewill help you much
> 
> http://wiki.apache.org/nutch/WritingPluginExample-0.9
> 
> 
> Alex
> 2008/11/12 kevin pang <[EMAIL PROTECTED]>
> 
> > Cool,
> >
> > Thanks for your reply.
> > I want to know which point will this extension
> extends?
> > thanks in advance.
> >
> > 2008/11/10 Cool The Breezer
> <[EMAIL PROTECTED]>
> >
> > > Create a new Nutch extension to add a new field
> to Document which
> > contains
> > > all text for all links available in a page. Take
> a look at NekoHTML or
> > > HTMLParser documents and get all links of any
> page. And extract texts for
> > > all links. Then add a new field to nutch
> document.
> > >
> > > I had same kind of requirement to get all image
> URLs from page and add
> > them
> > > as a new field in Nutch document. I have used
> htmlparser to extract all
> > > images and converted the URLs  as comma separated
> text and  added them as
> > a
> > > new field in index.
> > >
> > > - RB
> > >
> > >
> > > --- On Sun, 11/9/08, kevin pang
> <[EMAIL PROTECTED]> wrote:
> > >
> > > > From: kevin pang <[EMAIL PROTECTED]>
> > > > Subject: how to crawl all the urls in the
> page
> > > > To: [email protected]
> > > > Date: Sunday, November 9, 2008, 9:28 PM
> > > > i want to crawl all the urls in the page
> including those
> > > > display as text,not
> > > > just as hyper link, how to add this rule
> into nutch fetcher
> > > > ?
> > > > anyone can help ? much appriciated.
> > > >
> > > > Regards,
> > >
> > >
> > >
> > >
> >
> 
> 
> 
> -- 
> Best Regards
> Alexander Aristov

Re: how to crawl all the urls in the page

Reply via email to