Hi Nick, Bob, Henning & Ingo,
(Not sure who of you best to talk to; henning was in the sigline @
www.openbsd.org/papers/, Bob tends to be all over the WWW schtuff,
Nick tends to wear the documentation hat, and this chiefly concerns
Ingo's presentation.
I've bcc'd you all to avoid swamping your mailboxen by default with
misc@ reply-to-alls.
I'm copying misc@ in on this however, to avoid any behind-the-back
hard feelings.)

There was a recent misc thread where andres.p complained thusly:

>>  http://www.openbsd.org/papers/bsdcan11-mandoc-openbsd.html
> that page is encoded iso 8859-1, doesn't state so anywhere, breaks
> with browsers configured to default to utf8 in the absence of encoding
> qualifiers.
cf. <http://marc.info/?l=openbsd-misc&m=134083965227817&w=2>

I sent a diff adding a charset=iso-8859-1 meta tag content-type
parameter, and people had all kinds of responses, mostly  suggesting
that there was a much, much bigger problem than merely minor mojibake
gobbledygook in Ingo's presentation.

So I've now just gone through ALL the presentations on
http://www.openbsd.org/papers/ , and I've determined that the problem
is much, much smaller than it's cracked up to be in the misc thread.
This diff fixes things:

--- bsdcan11-mandoc-openbsd.html        2012-06-30 22:18:52.000000000 +0200
+++ bsdcan11-mandoc-openbsd.html.newentities    2012-06-30 22:34:58.000000000
+0200
@@ -13,7 +13,7 @@

 <p><a href="http://www.flickr.com/photos/tomkoadam/4778126822/";><img
 src="http://farm5.static.flickr.com/4115/4778126822_555b453a1e.jpg";></a></p>
-<p>Csikó - Foal. - Photo: Adam Tomkó @flickr (CC)</p>
+<p>Csik&oacute; - Foal. - Photo: Adam Tomk&oacute; @flickr (CC)</p>

 <HR>
 <P>Ingo Schwarze: Mandoc in OpenBSD - page 2: INTRO I -
@@ -725,7 +725,7 @@
 <HR>
 <P>Ingo Schwarze: Mandoc in OpenBSD - page 22: RECURRING II -
 BSDCan 2011, May 13, Ottawa</P>
-<H1>Bogue déjà vue:</H1>
+<H1>Bogue d&eacute;j&agrave; vue:</H1>
 <H2>Collecting regression tests.</H2>
 <UL>
 <LI>Slow start in 2009:

That's it. That's all.

If however that patch doesn't apply (because of mojibake), or if you
want to hear the whole long yakety-yak, then you'll probably like to
continue reading the long version of things, which commences here:

---LOW SIGNAL-TO-NOISE RATIO PAST THIS POINT. PROCEED AT OWN RISK.---

So again, the complaint was that there was mojibake gibberish in
Ingo's presentation, because the character encoding isn't specified
but defaults to UTF-8 in modern browsers, while the page is actually
iso-8859-1 encoded.
There were many objection to a simple addition of <HEAD><META
http-equiv="Content-Type" content="text/html; charset=iso-8859-1"
/><HEAD/> as a fix. Let me just tackle those one by one.

* Andres.p said we ought to use m4 or another macro language to make
this change apply to all pages, and use commit hooks to ensure it's
applied in the future.
--> Upon inspection, the only page that I could find that has a
mojibake problem is Ingo's slides page.
http://www.openbsd.org/index.html and most other openbsd.org pages
seem to *have* HTML headers with iso-8859-1 specified in them. CVSweb
and man.cgi don't, but I haven't found any mojibake there.

* Marc said to adapt ports-readmes.
--> Maybe so, but... complex endeavours, who's gonna do it, and in the
meantime, why not just fix things with a simple patch?

* Stuart was concerned that not changing ALL the pages was "likely to
be way too much pain for the translators."
--> Actually, all the translations just link to
www.openbsd.org/papers/ -- no i18n issue there.

* Tedu said: "Consolidating all that content into a consistent style,
any style, would be great."
--> Well, here's what's actually on /papers/:
 * magicpoint slides -- the mojibake issue doesn't arise (because
jpegs) <http://i.eho.st/ppwmqr1u.png>
   example: http://www.openbsd.org/papers/asiabsdcon2010_vether/index.html
 * ps/pdf files -- the issue doesn't arise
   examples: http://www.openbsd.org/papers/strlcpy-paper.ps ,
http://www.openbsd.org/papers/strlcpy-paper.pdf
 * s5 <http://meyerweb.com/eric/tools/s5/s5-intro.html> -- the few
presentations that are there seem fine, encoding-wise
   example: http://www.openbsd.org/papers/eurobsdcon07/pyr-loadbalancing/
 * kpresent slides -- the few presentations that are there seem fine,
encoding-wise
   example: http://www.openbsd.org/papers/nycbsdcon06_sparc64/
 * w3.org Slidy -- the few presentations that are there seem fine,
encoding-wise
   example:
http://www.openbsd.org/papers/asiabsdcon2010_epitome2/epitome2.html
 * Ian's doc2html script -- seems to be fine, encoding-wise
   see: http://cvs.openbsd.org/papers/oreilly2000/index.html
 * mp4 video -- the issue doesn't arise
   see: http://talks.dixongroup.net/nycbsdcon2008/
 * no idea what this was generated with, but it's grand:
   http://www.openbsd.org/papers/bsdcan06-wlan/index.html

* Dave said meta tags were ugly --> but frantisek was correct to
observe that AddDefaultCharset is in fact somewhat braindead, because
once the HTTPD server sends a Content-Type HTTP header, that *always*
overrides any Content-Type meta tag parameters. AddDefaultCharset
should really be called AddCharsetOverride, because that's what it
does. Worse, this overriding behaviour has been been made official as
per RFC (and I'm reminded here of people who object to changing bad
laws that outlaw X on grounds that "X is illegal"). It's so stupid,
and I don't think Dave even grokked frantisek's point about that. (And
yes, it sucks that browsers aren't clairvoyant about what charset is
coming down the line, but that's why we have defaults and universal
tags. And yes, browsers (partially) rendering a page twice is sucky,
but that's a client-side browser issue, and browsers can hide any
re-parsing from the user, and this will be local anyway, as no sane
browser would re-request the file from remote.

* Peter Laufenberg suggested Lua, Tim Howe suggested Perl and its
template toolkit
--> That's all nice and well if you're willing to do the work, but
it's not needed to fix this very minor problem.

So anyway, having read and considered all that and having done the
research, I then initially thought okay, if people don't like meta tag
Content-Type charset parameters, we could just convert the page to
UTF-8 to match modern defaults.
Famous last words.
This proved to be a bit of a bummer, chiefly because diffs
incorporating special characters can be tricky: The shell and MTAs
along the way and browsers that people may be accessing their
{g|e}mail from all need to play along, and it's especially difficult,
because to deal with UTF-8 bytes as Latin-1 character bytes, you have
to transmit a NUL character, and MTAs may, and most browsers almost
certainly will replace NUL characters with spaces.
Not wanting to give up easily, I then thought, ok, that's what base64
was invented for, and I did this:
 $ diff -u bsdcan11-mandoc-openbsd.html
bsdcan11-mandoc-openbsd.html.newUTF-8 | base64 - > base64patch
Then I thought I'd just email the base64patch, and people on the
receiving end could do:
 $ base64 -d base64patch > clearpatch
And then they could apply the clearpatch.
But then I thought, what about browsers that don't support UTF-8 yet;
this is going to break things for them. And then I had a brain wave --
why, just use named entities, and suddenly UTF-8 or ISO-8859-1 or
ISO-8859-15 or Windows, or Western, or whatever encoding is used won't
make a difference anymore, because things are all the same there with
named entities in HTML. Yay named entities in HTML:
<http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_reference
s#Character_entity_references_in_HTML>
And that's what my above diff is: It just turns the four accented
characters in Ingo's presentation (and that's all there is) into named
entities. Of course, the diff itself contains accented characters in
the -minus lines, but if these have turned to mojibake at your end,
just type the two &oacute; and the &eacute; and &agrave; entities in
manually. It's not even worth using base64 for. And the best thing:
The named entities will keep working even if you decide to specify a
charset in the future, whether with meta tags or HTTP headers.

And yes, sure, if anybody actually *wants* to write a Perl template
toolkit thing or wants to convert all the presentation slides to The
One True Format To End All Formats™, maybe using The One True Text
Editor, well, it's not up to me to tell you what you should or
shouldn't do. But I do think these 4 (four) named entities pretty much
solve the actual issue.

Thanks for your attention,
Ian

PS:
The other complaint andres.p mentioned in his earlier email was:
> concretely, the man and webcvs pages do not have links back to openbsd.org
>
> good design would be to make the openbsd logo at the top left corner be the
link
I'm inclined to agree, but I can't find the actual CVSweb script and
wosch@FBSD's man.cgi script. Bob?

PPS:
On the Tokyo PC Users Group's presentation, one *could*, to make
things easier to find, add a direct link to the offsite magic point
slides: http://www.openbsd-support.com/jp/en/htm/mgp/tokyopc05/index.html
Or, given permission, copy them to openbsd.org. But this is nitpicking.

Reply via email to