-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> I have a problem with the Spider.py script when it comes to URLs with
> &name=val parameters.

> <a
> HREF=http://www.hti.umich.edu/cgi/r/rsv/rsv-idx?type=DIV1&byte=1801>Genesis</a>
>
> gets scanned as
>
> http://www.hti.umich.edu/cgi/r/rsv/rsv-idx?type=DIV1

        Seems perfectly logical to me, since the '&' is interpreted by your
shell. You need to double-quote URLs inside HREF tags anyway, and leaving
them unquoted is actually not valid HTML syntax.

        <a href=http://www.foo.blort/>Foo</a>     <!-- wrong -->

        <a href="http://www.foo.blort/"; alt="foo">Foo</a>   <!-- right -->

> I spent more time than I care to admit hacking around the PyPlucker python
> files, but I cannot see where it is going wrong in sgmllib and/or
> TextParser....

        It's not a python problem, it's your shell.

        In the perl world, we can auto-escape all of this with
uri_escape($url) or just passing the url in list-mode to something like
'system()', which doesn't use a shell at all.


d.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQE9WuyWkRQERnB1rkoRAiTvAJ4kO3WA8B92h/UQ634rrvi/Cua+SACgzW1Z
rlRQs1K8s/WLScp5uQvtT0M=
=ViVm
-----END PGP SIGNATURE-----

Reply via email to