[Nutch-general] Re: MD5Hash

Jack Tang Wed, 18 Jan 2006 22:25:02 -0800

Hi Thomas

I suppose the only unique key of contents in web db is page' url. So
why not retrieve the content by url directly?


/Jack


On 1/8/06, Thomas Delnoij <[EMAIL PROTECTED]> wrote:
> I am working with Nutch 0.7.1.
>
> As far as I understand the current  implementation (please correct me if I
> am wrong), the MD5Hash is calculated based on the Pages' content. Pages with
> the same content but identified by different URLs, share the same MD5Hash.
>
> My requirement is to be able to uniquely identify all Pages in WebDB. Pages
> with the same content, but identified by different URL's, should become a
> unique MD5Hash. My question is if this is feasible at all and if yes, how
> this can be accomplished.
>
> Rgrds, Thomas Delnoij
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: MD5Hash

Reply via email to