Vincent Massol wrote: > On Feb 15, 2008, at 2:49 PM, Sergiu Dumitriu wrote: > >> Hi devs, >> >> We need to decide how to handle the charset/encoding in XWiki. We >> have 3 >> options: >> >> 1. Leave it as it is. The default is ISO-8859-1, and the admin has to >> make sure the JVM is started with the correct -Dfile.encoding param. >> If >> another encoding is needed, it has to be changed in 4 places (web.xml, >> xwiki.cfg, -Dfile.encoding, database charset+collation) >> >> 2. Force it to always be UTF-8, overriding the file.enconding setting. >> This ensures internationalization, as UTF-8 works with any language. >> And >> I think it is a safe move, as any modern system supports UTF-8 (given >> that XWiki requires java 5, we can assume it will be in a modern >> system). This has the advantage that the code will be simpler, as we >> don't have to check and switch encodings, but has the disadvantage >> that >> mysql has to be manually configured for UTF-8, as by default it >> comes in >> latin1. > > Isn't this a problem with databases which are configured in ISO8859-1 > by default most of the time?
Yes, it is. Right now there is a component somewhere that converts characters not supported by the encoding into &#xxx; escapes, but I can't remember which. With these escapes, the database always receives data in the encoding XWiki is configured with. What I would really like is if hibernate was smart enough to enforce encodings. Or to transparently encode data between the application and the database. Unfortunately, that's not the case. I'll have to check which encodings do DBMSs have implicitly. I only know that mysql comes with latin1, and I think hsql and derby come with utf. > Same question for the servlet container. The servlet container does not (usually) have an encoding. It works in the system encoding, which varies from OS and country. Windows systems usually are set to an encoding that reflects the language/country, and Linux systems mostly do the same, but tend to switch to UTF8. I checked what happens if I override the jvm encoding. It's not good, as it is replaced for all the apps, and in a shared container that's really bad. Thus, I'm against overriding the jvm encoding. This then makes option 2 impossible to implement, unless we decide to make XWiki products work only in certain environments. It will be possible to do this in several years, once people forget all about different charsets. Sometimes, decisions made in early stages are so hard to overcome and completely eliminate in later stages. Still, we can't work with reduced charsets anymore. People all over the world should be able to use XWiki, and right now it is not possible. Unicode is the way to go, and UTF-8 seems the best choice. Even if we can't impose it on the environment, I stilll think it should be used internally and externally. Internally means that whenever we have to switch from String to byte[] and back, we ask the conversion to be made using UTF-8. Externally means that the container is just a middle-man transparently handling data from and to the client, and the client already works with UTF-8. The web, being born a bit later, understood that Unicode is the right answer when determining which characters to support, so most technologies are made to work with Unicode, and its UTF-8 representation. We already have problems with URLs, GET parameters and AJAX calls because we're not working with UTF-8. The tough part is that there are some tools that handle conversions internally, and they work with the JVM encoding. We have such problems with JRCS (rollbacks replace non-ascii chars with question marks), and with FOP (the same question marks appear). I'll have to study what can be done to overcome these problems. I know that this decision is an important one, as it affects large portions of code. But it is a decision that must be made sooner rather than later, so that we can prepare for the switch (btw, if we vote anything other than 1, then this will be part of a future M1, and not of the 1.3 RCs). So, here's option number 4: Let the system as it is, since it must be shared with other applications, but work with UTF-8 both internally, by asking any String <-> byte[] conversion to be made in that encoding, and externally, by sending responses and expecting requests in UTF-8. Given that the database accepts charset configurations at any level (database, table, column), it is OK to ask admins to configure the XWiki database to a certain encoding. Even better, I think this can be done in an Aspect, so that we don't have to manually try-catch all transformations, and always be careful to manually specify the encoding. I'm not an AOP expert, so I'm not sure that this is possible. Is it? > I can't vote till I know the answer to these 2 questions. > > Thanks > -Vincent > > PS: As a principle I don't like hard-coding anything so if these > questions are answered satisfactorily I'll be ok but with a single > config parameter set to UTF8 by default in xwiki.cfg. > >> 3. Keep it configurable, but by only specifying it in one place >> (xwiki.cfg or web.xml), and enforcing that encoding in the JVM (by >> overriding file.encoding). The default should be UTF-8. >> >> Here's my +1 for option 2, -1 for option 1, and 0 for option 3. -- Sergiu Dumitriu http://purl.org/net/sergiu/ _______________________________________________ devs mailing list [email protected] http://lists.xwiki.org/mailman/listinfo/devs

