Hallvard,

You can possibly use id: checksum/secure read url and
this id as a unique hash identifier for the page.  With use this id
to index your database, and If two URLs have the exact
same content you will obtain the same checksum and you can
then add the new URL reference to the db without a needing to
update the URL content "page" as it will be already stored in
the case you are storing the pages.  If you had never seen
this id it means you got new content and you proceed to
store the (id, url, content) in the db.

This way of indexing is better than using the url as unique identifier.
I believe this is used by some cache servers like squid.

The chances of having two different pages generating the same
hash id via the checksum algorithm are really low; if I am correct it
rebol uses SHA1 for this.

Hope this helps. Cheers,  Jaime

-- The best way to predict the future is to invent it -- Steve Jobs

On Friday, October 24, 2003, at 02:31  AM, Hallvard Ystad wrote:

>
> Thanks both.
>
> But theoretically, a these two URLs may very well not
> represent the same document:
> http://www.uio.no/
> http://uio.no/
> but still reside on the same server (same dns entry).
>
> So ...  Is it possible to _know_ whether or not these two
> documents are the same without downloading their documents
> and comparing them? (I really don't think so myself, but
> someone might know something I don't.)
>
> I suddenly realize this has got very little to do with
> Rebol. Sorry.
>
> Hallvard
>
> Dixit Tom Conlin <[EMAIL PROTECTED]> (Wed, 22 Oct
> 2003 10:00:08 -0700 (PDT)):
>>
>> On Wed, 22 Oct 2003, Hallvard Ystad wrote:
>>
>>>
>>> Hi list
>>>
>>> My rebol stuff search engine now has more than 10000
>>> entries, and works pretty fast thanks to DocKimbels
>>> mysql
>>> protocol.
>>>
>>> Here's a problem:
>>> Some websites work both with and without the www prefix
>>> (ex. www.rebol.com and just plain and simple rebol.com).
>>> Sometimes this gives double records in my DB (ex.
>>> http://www.oops-as.no/cgi-bin/rebsearch.r?q=mysql :
>>> you'll
>>> see that both http://www.softinnov.com/bdd.html and
>>> http://softinnov.com/bdd.html appears).
>>>
>>> Is there a way to detect such behaviour on a server? Or
>>> do
>>> I have to compare my incoming document to whatever
>>> documents I already have in the DB that _might_ be the
>>> same document?
>>>
>>> Thnaks,
>>> Hallvard
>>>
>>> Pr?tera censeo Carthaginem esse delendam
>>> --
>>> To unsubscribe from this list, just send an email to
>>> [EMAIL PROTECTED] with unsubscribe as the subject.
>>>
>>
>> Hi Hallvard
>>
>> I ran into different reasons for finding more than one
>> url to a page
>> (URLs expressed as relative links)
>> and wrote a QAD function that served my purpose at the
>> time.
>>
>> just added Antons sugestion maybe it will serve
>>
>>
>> do
>> http://darkwing.uoregon.edu/~tomc/core/web/url-encode.r
>>
>> canotical-url: func[ url /local t p q][
>>    replace/all url "\" "/"
>>    t: parse url "/"
>>    while [p: find t ".."][remove remove back p]
>>    while [p: find t "."][remove p]
>>    p: find t ""
>>    while [p <> q: find/last t ""][remove q]
>>
>>    ;;; this is untested
>>    ;;; using Anton's sugguestion
>>
>>    if not find t/3 "www."[
>>      if equal? read join dns:// t/3 read join dns://www. t/3
>>      [insert t/3  "www."]
>>    ]
>>
>>    for i 1 (length? t) - 1 1[append t/:i "/"]
>>    to-url url-encode/re rejoin t
>> ]
>> -- 
>> To unsubscribe from this list, just send an email to
>> [EMAIL PROTECTED] with unsubscribe as the subject.
>>
>
> Pr?tera censeo Carthaginem esse delendam
> -- 
> To unsubscribe from this list, just send an email to
> [EMAIL PROTECTED] with unsubscribe as the subject.
>
>

-- 
To unsubscribe from this list, just send an email to
[EMAIL PROTECTED] with unsubscribe as the subject.

Reply via email to