On Wed, 22 Oct 2003, Hallvard Ystad wrote:

>
> Hi list
>
> My rebol stuff search engine now has more than 10000
> entries, and works pretty fast thanks to DocKimbels mysql
> protocol.
>
> Here's a problem:
> Some websites work both with and without the www prefix
> (ex. www.rebol.com and just plain and simple rebol.com).
> Sometimes this gives double records in my DB (ex.
> http://www.oops-as.no/cgi-bin/rebsearch.r?q=mysql : you'll
> see that both http://www.softinnov.com/bdd.html and
> http://softinnov.com/bdd.html appears).
>
> Is there a way to detect such behaviour on a server? Or do
> I have to compare my incoming document to whatever
> documents I already have in the DB that _might_ be the
> same document?
>
> Thnaks,
> Hallvard
>
> Pr�tera censeo Carthaginem esse delendam
> --
> To unsubscribe from this list, just send an email to
> [EMAIL PROTECTED] with unsubscribe as the subject.
>

Hi Hallvard

I ran into different reasons for finding more than one url to a page
(URLs expressed as relative links)
and wrote a QAD function that served my purpose at the time.

just added Antons sugestion maybe it will serve


do http://darkwing.uoregon.edu/~tomc/core/web/url-encode.r

canotical-url: func[ url /local t p q][
    replace/all url "\" "/"
    t: parse url "/"
    while [p: find t ".."][remove remove back p]
    while [p: find t "."][remove p]
    p: find t ""
    while [p <> q: find/last t ""][remove q]

    ;;; this is untested
    ;;; using Anton's sugguestion

    if not find t/3 "www."[
        if equal? read join dns:// t/3 read join dns://www. t/3
        [insert t/3  "www."]
    ]

    for i 1 (length? t) - 1 1[append t/:i "/"]
    to-url url-encode/re rejoin t
]
-- 
To unsubscribe from this list, just send an email to
[EMAIL PROTECTED] with unsubscribe as the subject.

Reply via email to