[Nutch-general] Re: How to get page content given URL only?

Piotr Kosiorowski Fri, 09 Dec 2005 03:47:10 -0800

Hi,
You can interate over the whole segment comparing current url with the one
you look for. The performance would not be great but it is possible.
Regards
Piotr


On 12/9/05, Thomas Delnoij <[EMAIL PROTECTED]> wrote:
>
> I had the same question as Nguyen. In the cache page the lookup uses the
> docNo to call the Segment.getContent(int docNo) method, which originates
> from the Index? So the question is if this lookup can be done when one did
> not index the pages, and wants to use the URL instead of the docNo. I was
> looking at this for quite some time, and I think the answer is 'no', but
> maybe I missed something.
>
> Rgrds, Thomas
>
> On 12/9/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> >
> > Take a look to the cache page, it returns the content from the segment.
> >
> > Am 09.12.2005 um 09:24 schrieb Nguyen Ngoc Giang:
> >
> > >   Hi everyone,
> > >
> > >   I'm writing a small program which just utilizes Nutch as a
> > > crawler only,
> > > with no search functionality. The program should be able to return
> > > page
> > > content given an url input. I would like to ask how can we get the
> > > page
> > > content given only the URL, since webdb only provides a mechanism
> > > to get
> > > meta data of a page given URL, while segments can read content but
> > > require a
> > > record number.
> > >
> > >   Any help is greatly appreciated.
> > >
> > >   Best regards,
> > >   Giang
> >
> >
>
>

[Nutch-general] Re: How to get page content given URL only?

Reply via email to