Hi,

I've used LWP in several apps in which the key bit of information I'm after is the headers. I've therefore got used to the fact that if the returned resource is HTML, one of the triggers for "OK, that's all the headers and everything else must be content" is the presence of anything in the <head> section of the document that LWP doesn't recognise.

Take this, for example:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";> <html xmlns:creativeCommons='http://backend.userland.com/creativeCommonsRssModule'
 xmlns="http://www.w3.org/1999/xhtml"; dir="ltr" lang="en-US">

<creativeCommons:license>http://creativecommons.org/licenses/by-nc-nd/3.0/</creativeCommons:license>

<head profile="http://gmpg.org/xfn/11";>
...

Perfectly valid XHTML - but... LWP doesn't recognise the <creativecommons... tag and so stops parsing the headers.

The User Agent package I'm using is version 2.31

So, some questions:

1. Which modules need updating so that LWP can recognise this kind of thing as valid <head> content

2. Has anyone written such a module?

As a demonstration, [1] and [2] show the status line, headers_as_string and content from two versions of the same document, the only difference between the two being that in [2], the <creativecommons..> tag is commented out. You can get this output from any URI using the form at [3].

Thanks for any help

Phil.

[1] http://www.icra.org/cgi-bin/HTTP_Headers.cgi?url=http%3A%2F%2Fwww.icra.org%2Flabel%2FHTTP-Test%2Fspace.htm [2] http://www.icra.org/cgi-bin/HTTP_Headers.cgi?url=http%3A%2F%2Fwww.icra.org%2Flabel%2FHTTP-Test%2Fspace-mod.htm
[3] http://www.icra.org/label/HTTP-Test/

--
Phil Archer
Chief Technical Officer,
Family Online Safety Institute
w. http://www.fosi.org/people/philarcher/

Register now for the annual Family Online Safety Institute Conference and Exhibition, December 11th, 2008, Washington, DC.
See http://www.fosi.org/conference2008/

Reply via email to