Quoting the href="value" did the trick. So my problem was really malformed html on the target site.
Fun fact: I have since observed that the URL works fine without the quotes with Python 1.5. Python 2.1 is apparently more tempermental. And where is the shell involved in this? I'm afraid I don't see the connection. Regards, and thank you, -craig >From: "David A. Desrosiers" <[EMAIL PROTECTED]> >To: Plucker General List <[EMAIL PROTECTED]> >Subject: Re: Problem plucking CGI URLs with & parameters >Date: Wed, 14 Aug 2002 19:49:40 -0400 (EDT) > >-----BEGIN PGP SIGNED MESSAGE----- >Hash: SHA1 > > > > I have a problem with the Spider.py script when it comes to URLs with > > &name=val parameters. > > > <a > > >HREF=http://www.hti.umich.edu/cgi/r/rsv/rsv-idx?type=DIV1&byte=1801>Genesis</a> > > > > gets scanned as > > > > http://www.hti.umich.edu/cgi/r/rsv/rsv-idx?type=DIV1 > > Seems perfectly logical to me, since the '&' is interpreted by your >shell. You need to double-quote URLs inside HREF tags anyway, and leaving >them unquoted is actually not valid HTML syntax. > > <a href=http://www.foo.blort/>Foo</a> <!-- wrong --> > > <a href="http://www.foo.blort/" alt="foo">Foo</a> <!-- right --> > > > I spent more time than I care to admit hacking around the PyPlucker >python > > files, but I cannot see where it is going wrong in sgmllib and/or > > TextParser.... > > It's not a python problem, it's your shell. > > In the perl world, we can auto-escape all of this with >uri_escape($url) or just passing the url in list-mode to something like >'system()', which doesn't use a shell at all. > > >d. > > >-----BEGIN PGP SIGNATURE----- >Version: GnuPG v1.0.7 (GNU/Linux) > >iD8DBQE9WuyWkRQERnB1rkoRAiTvAJ4kO3WA8B92h/UQ634rrvi/Cua+SACgzW1Z >rlRQs1K8s/WLScp5uQvtT0M= >=ViVm >-----END PGP SIGNATURE----- _________________________________________________________________ Send and receive Hotmail on your mobile device: http://mobile.msn.com

