[Nutch-general] Re: How to get page content given URL only?

Andy Liu Fri, 09 Dec 2005 08:20:03 -0800

Along the lines of what Piotr was suggesting, you can iterate through all
the segments, and create a url -> recNo hash table or you can store it in a
relational database for quick searching.  The URL is available in a
segment's FetchListEntry or Content objects, I believe.


On 12/9/05, Jack Tang <[EMAIL PROTECTED]> wrote:
>
> Hi Nguyen
>
> I am going to face this problem too. Here is my thoughts. One field
> will be add in the index, saying "uid", and the value of uid will be
> generate from URL. Say the url is http://www.a.com/x/y/z.hml
>
> uid = md5_hash("http://www.a.com";).append(md5_hash("/x/y/z.html"));
>
> Is that ok? When i query by this uid, only one page will be returned.
>
> Regards
> /Jack
> On 12/9/05, Nguyen Ngoc Giang <[EMAIL PROTECTED]> wrote:
> >   Hi,
> >
> >   Thanks Stefan and Piotr for your suggestions. My doubt is the same
> with
> > Thomas, since in the segments, we store only the RecNo, which can be
> > retrieved only via searching, which in turn requires indexing.
> >
> >   Can we add the URL of the page during fetching, so that the segment
> also
> > the URL? But I think it's no better than searching, since eventually we
> > still need to search for URL field in the segment.
> >
> >   Anyone can help? Thanks a lot.
> >
> >   Regards,
> >     Giang
> >
> >
> > On 12/9/05, Piotr Kosiorowski <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi,
> > > You can interate over the whole segment comparing current url with the
> one
> > > you look for. The performance would not be great but it is possible.
> > > Regards
> > > Piotr
> > >
> > > On 12/9/05, Thomas Delnoij <[EMAIL PROTECTED]> wrote:
> > > >
> > > > I had the same question as Nguyen. In the cache page the lookup uses
> the
> > > > docNo to call the Segment.getContent(int docNo) method, which
> originates
> > > > from the Index? So the question is if this lookup can be done when
> one
> > > did
> > > > not index the pages, and wants to use the URL instead of the docNo.
> I
> > > was
> > > > looking at this for quite some time, and I think the answer is 'no',
> but
> > > > maybe I missed something.
> > > >
> > > > Rgrds, Thomas
> > > >
> > > > On 12/9/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > > Take a look to the cache page, it returns the content from the
> > > segment.
> > > > >
> > > > > Am 09.12.2005 um 09:24 schrieb Nguyen Ngoc Giang:
> > > > >
> > > > > >   Hi everyone,
> > > > > >
> > > > > >   I'm writing a small program which just utilizes Nutch as a
> > > > > > crawler only,
> > > > > > with no search functionality. The program should be able to
> return
> > > > > > page
> > > > > > content given an url input. I would like to ask how can we get
> the
> > > > > > page
> > > > > > content given only the URL, since webdb only provides a
> mechanism
> > > > > > to get
> > > > > > meta data of a page given URL, while segments can read content
> but
> > > > > > require a
> > > > > > record number.
> > > > > >
> > > > > >   Any help is greatly appreciated.
> > > > > >
> > > > > >   Best regards,
> > > > > >   Giang
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>

[Nutch-general] Re: How to get page content given URL only?

Reply via email to