utf-8 woes

Pete Phillips Wed, 27 Feb 2008 11:18:35 -0800

Hi

I use xpathscript to process docbook xml articles for our woundcare
journal World Wide Wounds (www.worldwidewounds.com).


In recent years we have had lots of articles referencing articles in
Scandanavia, hence a requirement to use &ouml; and other entities in
authors' names.

Most of the time the processing works OK (they are usually found in the
citations), but we have an issue at the moment where

       &ouml; 

in the body of the text results in the output of a two-byte character,
and yet in a different part of the document (the bibliography) the
conversion works correctly and the 'o' with two little dots above it
comes out OK when viewed by firefox using ISO-8859-1 encoding.

Clearly this is a problem in the way I am handling the incoming xml.

An example of what doesn't work is text in a sidebar, where my code is:

------------------------------------------------------------------------
# Sidebar
$t->{'sidebar'}{testcode} = sub {
  my ($node, $t) = @_;
  my ($fileref);
  my $id;
  if ($id = findvalue('@id', $node)) {
    $t->{pre} .= "<a name=\"$id\"></a>";
  }
  # When we find a sidebar,
  # put link to it
  $t->{pre} .= '<DIV CLASS="sidebar">';
  $t->{post} = '</DIV>';
  return 1; #
};

$t->{'para'}{testcode} = sub {
  my ($node, $t2) = @_;
  my ($id);
  my ($anchor);
  if ($id = findvalue('@id', $node)) {
    $anchor=$id;
  }
  # we want to get rid of para breaks directly after
  # we start a glossary definition, so check for
  # $removepara > 0
  if ($removepara > 0) {
    $t2->{pre} .= "";
  } else {
    if ($anchor ne "") {
      $t2->{pre} .= "<p><a name=\"$anchor\"></a>";
  } else {
    $t2->{pre} .= "<p>";
  }
    #    $t2->{post} = "</p>"; # this really should work, but it messes up NS4.x
  }
  #$removepara=0;
  $removepara--;
  return 1;
 };
------------------------------------------------------------------------

If I use:

 <sidebar><para>This article was sponsored by an educational grant from
 L&#x00f6;nd Corp</para></sidebar>

the "Lönd Corp" comes out with a two-byte character in it.

Is there some easy hack which I am missing which would automatically
convert the text to the appropriate encoding ?  I suspect it is
something I should be doing in the 'para' subroutines ?

I'm happy to send the three files (main file plus two library files)
which do the conversion, but in total they come to about 30 pages of
code.

Any help much appreciated.

Regards,
Pete
--
Pete Phillips, Acting Director,     |   http://www.smtl.co.uk/
Surgical Materials Testing Lab,     |   http://www.worldwidewounds.com/
Princess of Wales Hospital, S Wales |   http://www.dressings.org/
Tel/Fax: +44 1656-752820/30         |   [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

utf-8 woes

Reply via email to