Quoting the href="value" did the trick.  So my problem was really malformed 
html on the target site.

Fun fact:  I have since observed that the URL works fine without the quotes 
with Python 1.5.  Python 2.1 is apparently more tempermental.  And where is 
the shell involved in this?  I'm afraid I don't see the connection.

Regards, and thank you,
   -craig


>From: "David A. Desrosiers" <[EMAIL PROTECTED]>
>To: Plucker General List <[EMAIL PROTECTED]>
>Subject: Re: Problem plucking CGI URLs with & parameters
>Date: Wed, 14 Aug 2002 19:49:40 -0400 (EDT)
>
>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>
> > I have a problem with the Spider.py script when it comes to URLs with
> > &name=val parameters.
>
> > <a
> > 
>HREF=http://www.hti.umich.edu/cgi/r/rsv/rsv-idx?type=DIV1&byte=1801>Genesis</a>
> >
> > gets scanned as
> >
> > http://www.hti.umich.edu/cgi/r/rsv/rsv-idx?type=DIV1
>
>       Seems perfectly logical to me, since the '&' is interpreted by your
>shell. You need to double-quote URLs inside HREF tags anyway, and leaving
>them unquoted is actually not valid HTML syntax.
>
>       <a href=http://www.foo.blort/>Foo</a>     <!-- wrong -->
>
>       <a href="http://www.foo.blort/"; alt="foo">Foo</a>   <!-- right -->
>
> > I spent more time than I care to admit hacking around the PyPlucker 
>python
> > files, but I cannot see where it is going wrong in sgmllib and/or
> > TextParser....
>
>       It's not a python problem, it's your shell.
>
>       In the perl world, we can auto-escape all of this with
>uri_escape($url) or just passing the url in list-mode to something like
>'system()', which doesn't use a shell at all.
>
>
>d.
>
>
>-----BEGIN PGP SIGNATURE-----
>Version: GnuPG v1.0.7 (GNU/Linux)
>
>iD8DBQE9WuyWkRQERnB1rkoRAiTvAJ4kO3WA8B92h/UQ634rrvi/Cua+SACgzW1Z
>rlRQs1K8s/WLScp5uQvtT0M=
>=ViVm
>-----END PGP SIGNATURE-----






_________________________________________________________________
Send and receive Hotmail on your mobile device: http://mobile.msn.com

Reply via email to