Jack, I am not so much interested in the content of the Page objects (BTW: the content is not stored in the WebDB but in the Segments, I believe). What I am after is the relationship between the Page and Link Objects. It is organized based on the MD5Hash of the Pages.
Rgrds, Thomas On 1/19/06, Jack Tang <[EMAIL PROTECTED]> wrote: > > Hi Thomas > > I suppose the only unique key of contents in web db is page' url. So > why not retrieve the content by url directly? > > /Jack > > > On 1/8/06, Thomas Delnoij <[EMAIL PROTECTED]> wrote: > > I am working with Nutch 0.7.1. > > > > As far as I understand the current implementation (please correct me if > I > > am wrong), the MD5Hash is calculated based on the Pages' content. Pages > with > > the same content but identified by different URLs, share the same > MD5Hash. > > > > My requirement is to be able to uniquely identify all Pages in WebDB. > Pages > > with the same content, but identified by different URL's, should become > a > > unique MD5Hash. My question is if this is feasible at all and if yes, > how > > this can be accomplished. > > > > Rgrds, Thomas Delnoij > > > > > > > -- > Keep Discovering ... ... > http://www.jroller.com/page/jmars >
