Pablo Barbachano wrote: > Hi, I have a wiki page like this (simplified for the report): > > cat >fo.mdwn <<EOF >  > EOF > > It is valid utf8. When it is converted to html, the 'ó' gets converted to > ó > > I don't know if it is bug in markdown or ikiwiki.
You can work around this bug by turning off the htmlscrubber module,
either by passing --disable-module htmlscrubber or by removing it from
your ikiwiki setup file.
And what it seems to be doing is not treating the input as utf-8 and encoding
each of the two bytes of the dual-width utf-8 character separately. Let's see:
[EMAIL PROTECTED]:~/src/ikiwiki/doc>markdown < foo.mdwn
<p><img src="../images/o.jpg" alt="o" title="ó" />
óóóóó</p>
[EMAIL PROTECTED]:~/src/ikiwiki/doc>markdown < foo.mdwn | perl -e 'use
HTML::Scrubber; my $s=HTML::Scrubber->new(allow => [qw{img}], default =>
[undef, { alt => 1, src => 1, title => 1 }]); while (<>) { print $s->scrub($_)
}'
<img src="../images/o.jpg" alt="o" title="ó">
óóóóó
[EMAIL PROTECTED]:~/src/ikiwiki/doc>markdown < foo.mdwn | perl -e 'use Encode;
use HTML::Scrubber; my $s=HTML::Scrubber->new(allow => [qw{img}], default =>
[undef, { alt => 1, src => 1, title => 1 }]); while (<>) { print
$s->scrub(Encode::decode_utf8($_)) }'
<img src="../images/o.jpg" alt="o" title="ó">
���
Not sure what happened to the "óóóóó" there, but on the right track..
[EMAIL PROTECTED]:~/src/ikiwiki/doc>markdown < foo.mdwn | perl -CSD -e 'use
HTML::Scrubber; my $s=HTML::Scrubber->new(allow => [qw{img}], default =>
[undef, { alt => 1, src => 1, title => 1 }]); while (<>) { print $s->scrub($_)
}'
<img src="../images/o.jpg" alt="o" title="ó">
óóóóó
So running perl with -CSD, as ikiwiki does, should make it work. But it
doesn't in ikiwiki, so I guess that what we get back from markdown in
ikiwiki is not being treated as utf8 internally before the sanitize hook
is called. I don't understand why though. This was changed in Recai's
big utf-8 patch in ikiwiki 1.5; if I back that patch out things work ok.
Or I could just do this:
Index: IkiWiki/Render.pm
===================================================================
--- IkiWiki/Render.pm (revision 795)
+++ IkiWiki/Render.pm (working copy)
@@ -39,9 +39,12 @@
}
if (exists $hooks{sanitize}) {
+ require Encode;
+ $content=Encode::decode_utf8($content);
foreach my $id (keys %{$hooks{sanitize}}) {
$content=$hooks{sanitize}{$id}{call}->($content);
}
+ $content=Encode::encode_utf8($content);
}
return $content;
This patch fixes the problem, but I don't understand why we have to
re-encode the string to utf-8 on the way out. ikiwiki should just be
using decoded utf-8 internally throughout and perl automatically
converting to utf-8 on output.
Beginning to think that Recai's patch wasn't the right approach. With my
patch above, if displaying a preview page, ikiwiki will now:
- Read it in from CGI as, apparently, raw utf-8
- decode_utf8 so it's in perl's internal representation
- htmlize it via markdown, which will include running decode_utf8
again on the markdown output as above, and then encode_utf8 so it's
back to raw utf-8
- decode_utf8 once again in the preview code
- finally turn it back into raw utf-8 again and emit it to the
browser
Yugh. This is becoming far too ugly to live. Maybe Recai can help figure
this out..
--
see shy jo
signature.asc
Description: Digital signature

