-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
> I have a problem with the Spider.py script when it comes to URLs with > &name=val parameters. > <a > HREF=http://www.hti.umich.edu/cgi/r/rsv/rsv-idx?type=DIV1&byte=1801>Genesis</a> > > gets scanned as > > http://www.hti.umich.edu/cgi/r/rsv/rsv-idx?type=DIV1 Seems perfectly logical to me, since the '&' is interpreted by your shell. You need to double-quote URLs inside HREF tags anyway, and leaving them unquoted is actually not valid HTML syntax. <a href=http://www.foo.blort/>Foo</a> <!-- wrong --> <a href="http://www.foo.blort/" alt="foo">Foo</a> <!-- right --> > I spent more time than I care to admit hacking around the PyPlucker python > files, but I cannot see where it is going wrong in sgmllib and/or > TextParser.... It's not a python problem, it's your shell. In the perl world, we can auto-escape all of this with uri_escape($url) or just passing the url in list-mode to something like 'system()', which doesn't use a shell at all. d. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.7 (GNU/Linux) iD8DBQE9WuyWkRQERnB1rkoRAiTvAJ4kO3WA8B92h/UQ634rrvi/Cua+SACgzW1Z rlRQs1K8s/WLScp5uQvtT0M= =ViVm -----END PGP SIGNATURE-----