Jack,

I am not so much interested in the content of the Page objects (BTW: the
content is not stored in the WebDB but in the Segments, I believe). What I
am after is the relationship between the Page and Link Objects. It is
organized based on the MD5Hash of the Pages.

Rgrds, Thomas

On 1/19/06, Jack Tang <[EMAIL PROTECTED]> wrote:
>
> Hi Thomas
>
> I suppose the only unique key of contents in web db is page' url. So
> why not retrieve the content by url directly?
>
> /Jack
>
>
> On 1/8/06, Thomas Delnoij <[EMAIL PROTECTED]> wrote:
> > I am working with Nutch 0.7.1.
> >
> > As far as I understand the current  implementation (please correct me if
> I
> > am wrong), the MD5Hash is calculated based on the Pages' content. Pages
> with
> > the same content but identified by different URLs, share the same
> MD5Hash.
> >
> > My requirement is to be able to uniquely identify all Pages in WebDB.
> Pages
> > with the same content, but identified by different URL's, should become
> a
> > unique MD5Hash. My question is if this is feasible at all and if yes,
> how
> > this can be accomplished.
> >
> > Rgrds, Thomas Delnoij
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>

Reply via email to