Along the lines of what Piotr was suggesting, you can iterate through all the segments, and create a url -> recNo hash table or you can store it in a relational database for quick searching. The URL is available in a segment's FetchListEntry or Content objects, I believe.
On 12/9/05, Jack Tang <[EMAIL PROTECTED]> wrote: > > Hi Nguyen > > I am going to face this problem too. Here is my thoughts. One field > will be add in the index, saying "uid", and the value of uid will be > generate from URL. Say the url is http://www.a.com/x/y/z.hml > > uid = md5_hash("http://www.a.com").append(md5_hash("/x/y/z.html")); > > Is that ok? When i query by this uid, only one page will be returned. > > Regards > /Jack > On 12/9/05, Nguyen Ngoc Giang <[EMAIL PROTECTED]> wrote: > > Hi, > > > > Thanks Stefan and Piotr for your suggestions. My doubt is the same > with > > Thomas, since in the segments, we store only the RecNo, which can be > > retrieved only via searching, which in turn requires indexing. > > > > Can we add the URL of the page during fetching, so that the segment > also > > the URL? But I think it's no better than searching, since eventually we > > still need to search for URL field in the segment. > > > > Anyone can help? Thanks a lot. > > > > Regards, > > Giang > > > > > > On 12/9/05, Piotr Kosiorowski <[EMAIL PROTECTED]> wrote: > > > > > > Hi, > > > You can interate over the whole segment comparing current url with the > one > > > you look for. The performance would not be great but it is possible. > > > Regards > > > Piotr > > > > > > On 12/9/05, Thomas Delnoij <[EMAIL PROTECTED]> wrote: > > > > > > > > I had the same question as Nguyen. In the cache page the lookup uses > the > > > > docNo to call the Segment.getContent(int docNo) method, which > originates > > > > from the Index? So the question is if this lookup can be done when > one > > > did > > > > not index the pages, and wants to use the URL instead of the docNo. > I > > > was > > > > looking at this for quite some time, and I think the answer is 'no', > but > > > > maybe I missed something. > > > > > > > > Rgrds, Thomas > > > > > > > > On 12/9/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote: > > > > > > > > > > Take a look to the cache page, it returns the content from the > > > segment. > > > > > > > > > > Am 09.12.2005 um 09:24 schrieb Nguyen Ngoc Giang: > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > I'm writing a small program which just utilizes Nutch as a > > > > > > crawler only, > > > > > > with no search functionality. The program should be able to > return > > > > > > page > > > > > > content given an url input. I would like to ask how can we get > the > > > > > > page > > > > > > content given only the URL, since webdb only provides a > mechanism > > > > > > to get > > > > > > meta data of a page given URL, while segments can read content > but > > > > > > require a > > > > > > record number. > > > > > > > > > > > > Any help is greatly appreciated. > > > > > > > > > > > > Best regards, > > > > > > Giang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Keep Discovering ... ... > http://www.jroller.com/page/jmars >
