[Nutch-general] Re: How to get page content given URL only?

Jack Tang Fri, 09 Dec 2005 07:48:06 -0800

Hi Nguyen

I am going to face this problem too. Here is my thoughts. One field
will be add in the index, saying "uid", and the value of uid will be
generate from URL. Say the url is http://www.a.com/x/y/z.hml


uid = md5_hash("http://www.a.com";).append(md5_hash("/x/y/z.html"));

Is that ok? When i query by this uid, only one page will be returned.

Regards
/Jack
On 12/9/05, Nguyen Ngoc Giang <[EMAIL PROTECTED]> wrote:
>   Hi,
>
>   Thanks Stefan and Piotr for your suggestions. My doubt is the same with
> Thomas, since in the segments, we store only the RecNo, which can be
> retrieved only via searching, which in turn requires indexing.
>
>   Can we add the URL of the page during fetching, so that the segment also
> the URL? But I think it's no better than searching, since eventually we
> still need to search for URL field in the segment.
>
>   Anyone can help? Thanks a lot.
>
>   Regards,
>     Giang
>
>
> On 12/9/05, Piotr Kosiorowski <[EMAIL PROTECTED]> wrote:
> >
> > Hi,
> > You can interate over the whole segment comparing current url with the one
> > you look for. The performance would not be great but it is possible.
> > Regards
> > Piotr
> >
> > On 12/9/05, Thomas Delnoij <[EMAIL PROTECTED]> wrote:
> > >
> > > I had the same question as Nguyen. In the cache page the lookup uses the
> > > docNo to call the Segment.getContent(int docNo) method, which originates
> > > from the Index? So the question is if this lookup can be done when one
> > did
> > > not index the pages, and wants to use the URL instead of the docNo. I
> > was
> > > looking at this for quite some time, and I think the answer is 'no', but
> > > maybe I missed something.
> > >
> > > Rgrds, Thomas
> > >
> > > On 12/9/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Take a look to the cache page, it returns the content from the
> > segment.
> > > >
> > > > Am 09.12.2005 um 09:24 schrieb Nguyen Ngoc Giang:
> > > >
> > > > >   Hi everyone,
> > > > >
> > > > >   I'm writing a small program which just utilizes Nutch as a
> > > > > crawler only,
> > > > > with no search functionality. The program should be able to return
> > > > > page
> > > > > content given an url input. I would like to ask how can we get the
> > > > > page
> > > > > content given only the URL, since webdb only provides a mechanism
> > > > > to get
> > > > > meta data of a page given URL, while segments can read content but
> > > > > require a
> > > > > record number.
> > > > >
> > > > >   Any help is greatly appreciated.
> > > > >
> > > > >   Best regards,
> > > > >   Giang
> > > >
> > > >
> > >
> > >
> >
> >
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: How to get page content given URL only?

Reply via email to