H! I thought I was ready with my own spider... But then there was a bug, or in other words a missing part in my code.
I forget that people do this in website html: <a href="http://www.nic.nl/monkey.html">is oke</a> <a href="../monkey.html">error</a> <a href="../../monkey.html">error</a> So now i'm trying to fix my spider but it fails and it fails. I tryed something like this. ------------------------ import urlparse import string def fixPath(urlpath,deep): path='' test = urlpath.split('/',deep) for this in test: if this<>'' and this.count('.')==0: path=path+'/'+this return path def fixUrl2(src,url): url = urlparse.urlparse('http://'+url) src = urlparse.urlparse('http://'+src) if url[2]: thepath = fixPath(url[2],url[2].count('/')-(src[2].count('/'))) if src[1] == '..': if url[1]<>'': theurl = url[1]+''+thepath+''+src[2].replace('../','') print theurl fixUrl2('../monkey2.html','www.nic.nl/test/info/monkey1.html') fixUrl2('../../monkey2.html','www.nic.nl/test/info/monkey1.html') fixUrl2('../monkey2.html','www.nic.nl/info/monkey1.html') ------------------------ info: fixUrl2('a new link found','in this page') I hope someone knows a professional working code for this, Thanks a lot, GC-Martijn -- http://mail.python.org/mailman/listinfo/python-list