Hi,
I've used LWP in several apps in which the key bit of information I'm
after is the headers. I've therefore got used to the fact that if the
returned resource is HTML, one of the triggers for "OK, that's all the
headers and everything else must be content" is the presence of anything
in the <head> section of the document that LWP doesn't recognise.
Take this, for example:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html
xmlns:creativeCommons='http://backend.userland.com/creativeCommonsRssModule'
xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US">
<creativeCommons:license>http://creativecommons.org/licenses/by-nc-nd/3.0/</creativeCommons:license>
<head profile="http://gmpg.org/xfn/11">
...
Perfectly valid XHTML - but... LWP doesn't recognise the
<creativecommons... tag and so stops parsing the headers.
The User Agent package I'm using is version 2.31
So, some questions:
1. Which modules need updating so that LWP can recognise this kind of
thing as valid <head> content
2. Has anyone written such a module?
As a demonstration, [1] and [2] show the status line, headers_as_string
and content from two versions of the same document, the only difference
between the two being that in [2], the <creativecommons..> tag is
commented out. You can get this output from any URI using the form at [3].
Thanks for any help
Phil.
[1]
http://www.icra.org/cgi-bin/HTTP_Headers.cgi?url=http%3A%2F%2Fwww.icra.org%2Flabel%2FHTTP-Test%2Fspace.htm
[2]
http://www.icra.org/cgi-bin/HTTP_Headers.cgi?url=http%3A%2F%2Fwww.icra.org%2Flabel%2FHTTP-Test%2Fspace-mod.htm
[3] http://www.icra.org/label/HTTP-Test/
--
Phil Archer
Chief Technical Officer,
Family Online Safety Institute
w. http://www.fosi.org/people/philarcher/
Register now for the annual Family Online Safety Institute Conference
and Exhibition, December 11th, 2008, Washington, DC.
See http://www.fosi.org/conference2008/