OK, I've done some digging and testing - and my "poking in the dark" hasn't come up with the magic answer (mainly because I'm an amateur code-writer who's a few fathoms out of his depth here).

In HTML::HeadParser I tried adding another regEx substitution to the flush_text routine - no effect.

Eventually I went back to the main HTML::Parser and tried changing that. The parse_file routine is the main guts of the module I believe? I tried stripping out any BOMs found as the chunks of data were read in but again, no effect.

Incidentally, I found a routine on the W3C site that I was able to quickly adapt to detect/strip BOMs (see below). What I can't see, I'm sorry to say, is where it should go in. Again, a cry for help!

Routine for detecting/stripping BOMs: (see http://dev.w3.org/cvsweb/p3p-validator/20001215/xml.pl?rev=1.5)

sub check_bom {
my $content = shift;
my $top1 = unpack("C", substr($content, 0, 1));
my $top2 = unpack("C", substr($content, 1, 1));
my $top3 = unpack("C", substr($content, 2, 1));
my $top4 = unpack("C", substr($content, 3, 1));

# UTF-8
if($top1 eq 239 && $top2 eq 187 && $top3 eq 191) {
 $content = substr($content, 3, length($content) - 3);
}

# UTF-16 little endian
if($top1 eq 255 && $top2 eq 254) {
 $content = substr($content, 2, length($content) - 2);
}

# UTF-16 big endian
if($top1 eq 254 && $top2 eq 255) {
 $content = substr($content, 2, length($content) - 2);
}

# UTF-32 little endian
if($top1 eq 255 && $top2 eq 254 && $top3 eq 0 && $top4 eq 0) {
 $content = substr($content, 4, length($content) - 4);
}

# UTF-32 big endian
if($top1 eq 254 && $top2 eq 255 && $top3 eq 0 && $top4 eq 0) {
 $content = substr($content, 4, length($content) - 4);
}
return $content;
}

Phil.

----- Original Message ----- From: "Gisle Aas" <[EMAIL PROTECTED]>
To: "Phil Archer" <[EMAIL PROTECTED]>
Cc: "libwww list" <[EMAIL PROTECTED]>
Sent: Thursday, October 07, 2004 10:26 AM
Subject: Re: Byte Order Mark mucks up headers



"Phil Archer" <[EMAIL PROTECTED]> writes:

I've read Sean Burke's book, I've looked through the archives of this
list and done other searches but can't find an answer to a problem I
have found with LWP. If the character coding for a website has a byte
order mark (things like utf-16, all that "big endian/little endian"
stuff) then LWP can't interpret HTML headers in the usual way. Does
anyone know a way around this?

HML::HeadParser needs to be fixed. It will assume that there is no <head> section when it sees text before anything else. The part of the code responsible for this currently allows whitespace, but needs to be tought that BOM is harmless too. Look at the 'text' method.

Do you want to try to provide a patch?

Regards,
Gisle




Reply via email to