On Mon, Mar 12, 2007 at 09:22:02AM +0200, Athan wrote: > As already told before, there are many other issues with utf-8, most of them > with reciepts. Given the fact that most reciepts use single byte regex, > utf-8 issues is something expected. However I know that you cannot do > anything for that, except maybe consider using utf-8 as the single-one > default encoding for pmwiki core.
The question of using utf-8 as the default encoding in the core comes up a fair bit (e.g., in PITS 00682), so let me answer it directly. Yes, I agree that utf-8 is now the preferred encoding for web pages, even for English and Western European languages. However, that wasn't the case when PmWiki was created, which is partially why PmWiki has traditionally defaulted to iso-8859-1. But more to the point, there are a variety of PHP functions and features that simply fail to work properly when utf-8 is the encoding being used. The biggest limitation with utf-8 is that regular expression patterns can no longer use "[[:upper:]]" and "[[:lower:]]" to match non-ASCII uppercase and lowercase characters in strings. This used to be a serious limitation when many sites were running with WikiWords enabled, because PmWiki could not detect wikiwords in the markup text without these patterns. It's much less of an issue now that PmWiki ships with WikiWords disabled by default... but it's still a bit of an issue. Many case-insensitive functions cease to be case-insensitive for utf-8; in particular, the '/i' flag to preg_match and preg_replace patterns doesn't seem to work for non-ASCII letters. Another limitation is that some locales (e.g., date and time strings returned by PHP's strftime() function) expect to be displayed using an iso-8859-1 character set, and thus won't work properly if utf-8 is chosen. Still another problem is dealing with non-ASCII characters in filenames; switching to a utf-8 encoding means that any existing pages or attachments with non-ASCII characters in their names will have to be fixed in order to work properly. And, at least on my systems (Linux), filenames with iso-8859-1 encodings display properly, while utf-8 filenames appear garbled. (I fully admit that for many people this situation is reversed, such that utf-8 appears correct while iso-8859-1 appears garbled... my point is simply that no matter what PmWiki does by default it is going to cause problems for some group of people.) So, in order to default to utf-8, we have to provide workarounds for the things that don't work in PHP, and every workaround has the potential to really slow down page rendering and other features. Rather than default to utf-8 and thus hit _every_ site with the workaround performance penalty even when utf-8 isn't being used, PmWiki defaults to iso-8859-1 (where PHP works most efficiently) and lets those sites that need or want utf-8 encoding do a simple include to get utf-8 to work. This isn't to say that PmWiki will never switch to using a utf-8 encoding by default... I'm only saying that there are a few large hurdles yet to be overcome before we can do that. Pm _______________________________________________ pmwiki-users mailing list [email protected] http://www.pmichaud.com/mailman/listinfo/pmwiki-users
