Re: Character encoding for APT files

Hervé BOUTEMY Fri, 23 Jan 2009 12:25:19 -0800

I knew this would cause another discussion: encoding choices are always like 
this :)


Le vendredi 23 janvier 2009, Trevor Harmon a écrit :
> On Jan 22, 2009, at 4:50 PM, Hervé BOUTEMY wrote:
> > Sorry, I was working on other things and missed this discussion.
> > I just commented (and closed as "Not A Bug" :) ) the issue.
>
> I agree that autodetecting is not a bullet-proof feature, but an
> absolute guarantee is not required in this case. I share Jason van
> Zyl's view: "If it's right most of the time, and it saves the user
> from having to know or worry about it then yes I would use it." [1]
the problem with such an auto-dection in a tool like Doxia used by 
maven-site-plugin is that if the guessed encoding is not right, you can't do 
anything (or you have to configure it, which is what you wanted to avoid)
It is not the case for example in a GUI, like a web browser, where a user can 
change the encoding in a couple of clicks if there is a problem

>
> Another issue is that without autodetection, supporting more than one
> type of character encoding for the APT files in a Maven project is
> impossible.
same remarks than before: and what if guessed encoding from a file is wrong?

>
> That said, if autodetection is simply out of the question, let me
> suggest a different tack. Doxia appears to require ISO-8859-1 for APT
> files by default. This is a Western-centric encoding that lacks
> support for Asian languages. It is also deprecated. According to
> Wikipedia:
>
> "The ISO/IEC working group responsible for maintaining eight-bit coded
> character sets disbanded and ceased all maintenance of ISO 8859,
> including ISO 8859-1, in order to concentrate on the Universal
> Character Set and Unicode." [2]
>
> I would also say that with the increasing popularity of UTF-8, the
> number of encoding problems encountered by users due to Doxia favoring
> ISO-8859-1 is already larger than any problems that might occur due to
> bad autodetection. In other words, autodetection might be wrong some
> of the time, but for many users, ISO-8859-1 is wrong all of the time.
Yes, I understand this one: historic default encoding is ISO-8859-1, which is 
problematic for a lot of people.
There was a proposal implemented in a lot of Maven plugin to make encoding 
easily configurable: see [4]
When the question of default encoding came, there was a large poll (you'll 
find links in the proposal), which came to the conclusion that default source 
encoding should be platform encoding.

The configuration part of the proposal was taken into account in 
maven-site-plugin 2.0-beta-7 on 03 Jul 2008 (see MSITE-314), but the default 
encoding wasn't changed: it is tracked MSITE-326 to let people vote if they 
want platform encoding (= the full proposal, which is platform dependant) 
instead of ISO-8859-1. There don't seem to be real traction...

There are a lot of Maven plugins today that complain if you don't configure 
default encoding: it is a simple property to add in your POM. Doesn't it meet 
your needs?

>
> In light of this, I suggest changing Doxia's APT handling so that it
> defaults to UTF-8 rather than ISO-8859-1. Not only will this help 
> UTF-8 users (who may be a majority),
do you have figures, or is it a guess? AFAIK, Windows default encoding is 
still CP-1252 in west european languages. I don't know if this has changed 
with Vista.
Then I doubt everybody switched to UTF-8.
There is no really ideal default encoding: only configuration fixes the issue.

> it will also help increase 
> Maven's acceptance in the Asian world, a trend that is already
> happening [3].
>
> I can work on a patch for this, if there's a chance it will be accepted.
>
> Trevor
>
> [1]
> http://www.nabble.com/Re%3A--VOTE--POM-Element-for-Source-File-Encoding-p16
>566779.html [2] http://en.wikipedia.org/wiki/ISO_8859-1
> [3]
> http://blogs.sonatype.com/people/2008/07/apache-maven-the-definitive-chines
>e-guide/
[4] 
http://docs.codehaus.org/display/MAVENUSER/POM+Element+for+Source+File+Encoding

Re: Character encoding for APT files

Reply via email to