I have a functional gettext-based internationalized content management system for a while now. A number of translators have offered their support, and I have localization files for Swedish, Norwegian, Chinese, Arabic, Turkish, Japanese, Spanish, etc.

The PHP software system is utf-8 based, so character sets haven't been an issue. Indeed, everything's been working quite well, but I just noticed a procedural item that made me wonder what the best approach is.

When non-roman language translators (japanese, arabic, chinese) send me their messages.po files, I open and save them as "utf-8 (no BOM)" files to preserve their integrity. (I use BBEdit on Mac OS X, which handles this nicely).

When using Spanish, Swedish, etc files, however, many of the translators have converted the text strings to HTML entities, e.g. "español". In one way, this makes sense, since they are to be displayed on a web page. But is it the right thing to do? Or should such strings be in messages.po with all their accents, and converted with htmlspecialchars() before output?

The issue cropped up because I'm converting the site to XHTML 1.1 output, and that means encoding things like ampersands. I have functions for creating drop-down menus (e.g. "categories" and "languages"). If a menu has an item like a "Crime & Punishment" category, I'd want to convert it to "Crime & Punishment" for XHTML compliance. But I don't want the language menu to RE-encode "español" as "espanñol", which would screw everything up.

So what's the best way to handle the relationship between HTML entities and gettext-based messages.po files?

In fact, the larger question is: do HTML entities really need to be entity-ized on utf-8 pages, whose character set actually should be capable of displaying the characters? Obviously "htmlspecialchars()" handles characters that cause output problems (like < and >, which indicate tag opening/closing), but for a utf-8 based system, "n tilde" doesn't need to be encoded at all, does it?

It seems like early HTML education would state categorically that "espaņol" needs to be written as "espag&ntilde;ol" on a web page, but that isn't really true for utf-8 pages, is it?

spud.

-------------------------------------------------------------------
a.h.s. boy
spud(at)nothingness.org            "as yes is to if,love is to yes"
http://www.nothingness.org/
-------------------------------------------------------------------

--
PHP Internationalization Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Reply via email to