Hi Nick, Bob, Henning & Ingo, (Not sure who of you best to talk to; henning was in the sigline @ www.openbsd.org/papers/, Bob tends to be all over the WWW schtuff, Nick tends to wear the documentation hat, and this chiefly concerns Ingo's presentation. I've bcc'd you all to avoid swamping your mailboxen by default with misc@ reply-to-alls. I'm copying misc@ in on this however, to avoid any behind-the-back hard feelings.)
There was a recent misc thread where andres.p complained thusly: >> http://www.openbsd.org/papers/bsdcan11-mandoc-openbsd.html > that page is encoded iso 8859-1, doesn't state so anywhere, breaks > with browsers configured to default to utf8 in the absence of encoding > qualifiers. cf. <http://marc.info/?l=openbsd-misc&m=134083965227817&w=2> I sent a diff adding a charset=iso-8859-1 meta tag content-type parameter, and people had all kinds of responses, mostly suggesting that there was a much, much bigger problem than merely minor mojibake gobbledygook in Ingo's presentation. So I've now just gone through ALL the presentations on http://www.openbsd.org/papers/ , and I've determined that the problem is much, much smaller than it's cracked up to be in the misc thread. This diff fixes things: --- bsdcan11-mandoc-openbsd.html 2012-06-30 22:18:52.000000000 +0200 +++ bsdcan11-mandoc-openbsd.html.newentities 2012-06-30 22:34:58.000000000 +0200 @@ -13,7 +13,7 @@ <p><a href="http://www.flickr.com/photos/tomkoadam/4778126822/"><img src="http://farm5.static.flickr.com/4115/4778126822_555b453a1e.jpg"></a></p> -<p>Csikó - Foal. - Photo: Adam Tomkó @flickr (CC)</p> +<p>Csikó - Foal. - Photo: Adam Tomkó @flickr (CC)</p> <HR> <P>Ingo Schwarze: Mandoc in OpenBSD - page 2: INTRO I - @@ -725,7 +725,7 @@ <HR> <P>Ingo Schwarze: Mandoc in OpenBSD - page 22: RECURRING II - BSDCan 2011, May 13, Ottawa</P> -<H1>Bogue déjà vue:</H1> +<H1>Bogue déjà vue:</H1> <H2>Collecting regression tests.</H2> <UL> <LI>Slow start in 2009: That's it. That's all. If however that patch doesn't apply (because of mojibake), or if you want to hear the whole long yakety-yak, then you'll probably like to continue reading the long version of things, which commences here: ---LOW SIGNAL-TO-NOISE RATIO PAST THIS POINT. PROCEED AT OWN RISK.--- So again, the complaint was that there was mojibake gibberish in Ingo's presentation, because the character encoding isn't specified but defaults to UTF-8 in modern browsers, while the page is actually iso-8859-1 encoded. There were many objection to a simple addition of <HEAD><META http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /><HEAD/> as a fix. Let me just tackle those one by one. * Andres.p said we ought to use m4 or another macro language to make this change apply to all pages, and use commit hooks to ensure it's applied in the future. --> Upon inspection, the only page that I could find that has a mojibake problem is Ingo's slides page. http://www.openbsd.org/index.html and most other openbsd.org pages seem to *have* HTML headers with iso-8859-1 specified in them. CVSweb and man.cgi don't, but I haven't found any mojibake there. * Marc said to adapt ports-readmes. --> Maybe so, but... complex endeavours, who's gonna do it, and in the meantime, why not just fix things with a simple patch? * Stuart was concerned that not changing ALL the pages was "likely to be way too much pain for the translators." --> Actually, all the translations just link to www.openbsd.org/papers/ -- no i18n issue there. * Tedu said: "Consolidating all that content into a consistent style, any style, would be great." --> Well, here's what's actually on /papers/: * magicpoint slides -- the mojibake issue doesn't arise (because jpegs) <http://i.eho.st/ppwmqr1u.png> example: http://www.openbsd.org/papers/asiabsdcon2010_vether/index.html * ps/pdf files -- the issue doesn't arise examples: http://www.openbsd.org/papers/strlcpy-paper.ps , http://www.openbsd.org/papers/strlcpy-paper.pdf * s5 <http://meyerweb.com/eric/tools/s5/s5-intro.html> -- the few presentations that are there seem fine, encoding-wise example: http://www.openbsd.org/papers/eurobsdcon07/pyr-loadbalancing/ * kpresent slides -- the few presentations that are there seem fine, encoding-wise example: http://www.openbsd.org/papers/nycbsdcon06_sparc64/ * w3.org Slidy -- the few presentations that are there seem fine, encoding-wise example: http://www.openbsd.org/papers/asiabsdcon2010_epitome2/epitome2.html * Ian's doc2html script -- seems to be fine, encoding-wise see: http://cvs.openbsd.org/papers/oreilly2000/index.html * mp4 video -- the issue doesn't arise see: http://talks.dixongroup.net/nycbsdcon2008/ * no idea what this was generated with, but it's grand: http://www.openbsd.org/papers/bsdcan06-wlan/index.html * Dave said meta tags were ugly --> but frantisek was correct to observe that AddDefaultCharset is in fact somewhat braindead, because once the HTTPD server sends a Content-Type HTTP header, that *always* overrides any Content-Type meta tag parameters. AddDefaultCharset should really be called AddCharsetOverride, because that's what it does. Worse, this overriding behaviour has been been made official as per RFC (and I'm reminded here of people who object to changing bad laws that outlaw X on grounds that "X is illegal"). It's so stupid, and I don't think Dave even grokked frantisek's point about that. (And yes, it sucks that browsers aren't clairvoyant about what charset is coming down the line, but that's why we have defaults and universal tags. And yes, browsers (partially) rendering a page twice is sucky, but that's a client-side browser issue, and browsers can hide any re-parsing from the user, and this will be local anyway, as no sane browser would re-request the file from remote. * Peter Laufenberg suggested Lua, Tim Howe suggested Perl and its template toolkit --> That's all nice and well if you're willing to do the work, but it's not needed to fix this very minor problem. So anyway, having read and considered all that and having done the research, I then initially thought okay, if people don't like meta tag Content-Type charset parameters, we could just convert the page to UTF-8 to match modern defaults. Famous last words. This proved to be a bit of a bummer, chiefly because diffs incorporating special characters can be tricky: The shell and MTAs along the way and browsers that people may be accessing their {g|e}mail from all need to play along, and it's especially difficult, because to deal with UTF-8 bytes as Latin-1 character bytes, you have to transmit a NUL character, and MTAs may, and most browsers almost certainly will replace NUL characters with spaces. Not wanting to give up easily, I then thought, ok, that's what base64 was invented for, and I did this: $ diff -u bsdcan11-mandoc-openbsd.html bsdcan11-mandoc-openbsd.html.newUTF-8 | base64 - > base64patch Then I thought I'd just email the base64patch, and people on the receiving end could do: $ base64 -d base64patch > clearpatch And then they could apply the clearpatch. But then I thought, what about browsers that don't support UTF-8 yet; this is going to break things for them. And then I had a brain wave -- why, just use named entities, and suddenly UTF-8 or ISO-8859-1 or ISO-8859-15 or Windows, or Western, or whatever encoding is used won't make a difference anymore, because things are all the same there with named entities in HTML. Yay named entities in HTML: <http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_reference s#Character_entity_references_in_HTML> And that's what my above diff is: It just turns the four accented characters in Ingo's presentation (and that's all there is) into named entities. Of course, the diff itself contains accented characters in the -minus lines, but if these have turned to mojibake at your end, just type the two ó and the é and à entities in manually. It's not even worth using base64 for. And the best thing: The named entities will keep working even if you decide to specify a charset in the future, whether with meta tags or HTTP headers. And yes, sure, if anybody actually *wants* to write a Perl template toolkit thing or wants to convert all the presentation slides to The One True Format To End All Formats™, maybe using The One True Text Editor, well, it's not up to me to tell you what you should or shouldn't do. But I do think these 4 (four) named entities pretty much solve the actual issue. Thanks for your attention, Ian PS: The other complaint andres.p mentioned in his earlier email was: > concretely, the man and webcvs pages do not have links back to openbsd.org > > good design would be to make the openbsd logo at the top left corner be the link I'm inclined to agree, but I can't find the actual CVSweb script and wosch@FBSD's man.cgi script. Bob? PPS: On the Tokyo PC Users Group's presentation, one *could*, to make things easier to find, add a direct link to the offsite magic point slides: http://www.openbsd-support.com/jp/en/htm/mgp/tokyopc05/index.html Or, given permission, copy them to openbsd.org. But this is nitpicking.

