Suggested resolution (was Re: Byte Order Mark mucks up headers)

Phil Archer Sun, 24 Oct 2004 10:43:39 -0700

Dear all,

A couple of weeks ago I raised an issue about Byte Order Marks effectively disabling the header parsing functions. Thanks again to those who took the trouble to reply. After a bit of poking around in the dark I've fixed it for my own needs - I leave it to others to judge whether this is robust enough for general usage, especially since I have only carried out cursory testing.

Within the HTML Head Parser there is a routing called text like this:

sub text
{
   my($self, $text) = @_;
   print "TEXT[$text]\n" if $DEBUG;
   my $tag = $self->{tag};
   if (!$tag && $text =~ /\S/) {
# Normal text means start of body
       $self->eof;
return;
   }
   return if $tag ne 'title';
   $self->{'text'} .= $text;
}

This is where the byte order mark is detected and the process stops since it's text outside any tag. So I've just added an extra term to the if statement thus:

sub text
{
   my($self, $text) = @_;
   print "TEXT[$text]\n" if $DEBUG;
   my $tag = $self->{tag};
   if (!$tag && $text =~ /\S/ && !BOM($text)) {
# Normal text means start of body
       $self->eof;
return;
   }
   return if $tag ne 'title';
   $self->{'text'} .= $text;
}

And defined a little routine thus:

sub BOM {
my $text = shift;
my $top1 = unpack("C", substr($text, 0, 1));
my $top2 = unpack("C", substr($text, 1, 1));
my $top3 = unpack("C", substr($text, 2, 1));
my $top4 = unpack("C", substr($text, 3, 1));

# UTF-8
if($top1 eq 239 && $top2 eq 187 && $top3 eq 191) {
  return 'UTF-8';
}

# UTF-16 little endian
if($top1 eq 255 && $top2 eq 254) {
  return 'UTF-16 little endian';
}

# UTF-16 big endian
if($top1 eq 254 && $top2 eq 255) {
  return 'UTF-16 big endian';
}

# UTF-32 little endian
if($top1 eq 255 && $top2 eq 254 && $top3 eq 0 && $top4 eq 0) {
  return 'UTF-32 little endian';
}

# UTF-32 big endian
if($top1 eq 254 && $top2 eq 255 && $top3 eq 0 && $top4 eq 0) {
  return 'UTF-32 big endian';
}
return 0;
}

This is an adaptation of a routine found at http://dev.w3.org/cvsweb/p3p-validator/20001215/xml.pl?rev=1.5.

I have not been able to test this on any BOMs other than UTF-8. If you use a BOM other than that, I'd be very pleased to hear of it.

The changes are in place on the ICRA label tester: www.icra.org/label/tester/ (this looks for PICS labels in the headers, hence the importance of this bit of LWP for me!). In the original e-mail for this thread I gave the following two examples, both of which now work correctly:

An example of a site with a BOM that previously showed as having no headers but is now OK: http://www.icra.org/cgi-bin/labelTester.cgi?lang=EN&url=http%3A%2F%2Fwww.xtranslations.com&showHead=on&showContent=on

An example of a site with a label without a BOM (that still works as it should!) http://www.icra.org/cgi-bin/labelTester.cgi?lang=EN&url=http%3A%2F%2Fwww.yahoo.com&showHead=on&showContent=on


Phil Archer
Chief Technical Officer
Internet Content Rating Association
Label your site today at http://www.icra.org

----- Original Message ----- From: "Phil Archer" <[EMAIL PROTECTED]> To: "libwww list" <[EMAIL PROTECTED]> Sent: Thursday, October 07, 2004 10:11 AM Subject: Byte Order Mark mucks up headers

Hi,
I've read Sean Burke's book, I've looked through the archives of this list and done other searches but can't find an answer to a problem I have found with LWP. If the character coding for a website has a byte order mark (things like utf-16, all that "big endian/little endian" stuff) then LWP can't interpret HTML headers in the usual way. Does anyone know a way around this?
Background:
I work for an organisation called ICRA. We provide a self-labelling and filtering system for the web, currently based on the old PICS standard but soon to move to RDF. A couple of years ago I built a tool for our website that visits a site, checks for PICS labels and parses them if found. Now, I can strip out the BOM from the content where found and do other clunky processing but that would mean I can't use LWP's efficient header commands. For sites without a BOM I can just get header->('Pics-label') and process that.
You can see the label tester at www.icra.org/label/tester/
An example of a site with a BOM that shows as unlabelled even though it is: http://www.icra.org/cgi-bin/labelTester.cgi?lang=EN&url=http%3A%2F%2Fwww.xtranslations.com&showHead=on&showContent=on

An example of a site with a label without a BOM (i.e one that works as it should) would be http://www.icra.org/cgi-bin/labelTester.cgi?lang=EN&url=http%3A%2F%2Fwww.yahoo.com&showHead=on&showContent=on
Any help gratefully accepted.
Phil.
Phil Archer
Chief Technical Officer
Internet Content Rating Association
Label your site today at http://www.icra.org

Suggested resolution (was Re: Byte Order Mark mucks up headers)

Reply via email to