subject:"\[VOTE\] POM Element for Source File Encoding"


Benjamin Bentmann wrote:
You could of course write an encoding detection plugin which could 
examine the code and set the required property accordingly.


Personally, I don't see the use case for that. If there are really users 
out there that don't know what file encoding they are using when writing up

their sources, they are most probably happy with the proposed default value
of Latin-1. Alternatively, this encoding detection plugin could be as 
simple as printing out the Java system property ${file.encoding} which obviously

worked well enough for the user.


${file.encoding} will only work if the file originated on the same machine.

I think of semi-automatic conversions of inhomogenous code into maven. 
E.g. some teacher collects homework from his students as a bunch of zip 
files containing only source, has a script to turn each into a maven 
project, and a master project interacting with them, like letting them 
compete with one another or whatever. In this case one might wish to 
automatically detect the encoding of every module, especially in locales 
with several commonly used encodings, so that string literals in these 
classes are handled correctly without the students even knowing what an 
encoding is.


But that's a corner case, so I guess we should stop discussion about the 
use of such a program here, until someone actually requires it.


Greetings,
 Martin



signature.asc
Description: OpenPGP digital signature

Re: [VOTE] POM Element for Source File Encoding


Paul Benedict wrote:

Just a proposal: Maven could loosen its parsing rules when it detects
versions greater than it is configured to accept.

Forward compatibility would be nice.


For anyone seriously interested in interoperability , I suggest a look 
at http://www.w3.org/2005/05/xsd-versioning-resources.html , especially 
the use cases, which illustrate several issues quite well.


 Martin



signature.asc
Description: OpenPGP digital signature

Re: [VOTE] POM Element for Source File Encoding


Benjamin Bentmann wrote:
With regard to user errors, my general 
suggestion is to fail the build. This unforgiving attitude should not be 
that unfamilar to users: It has been chosen for a popular format like 
XML which is also employed by Maven for a few files.


The problems depend on the encodings: If one feeds Latin-1 into an UTF-8 
decoder, you most likely encouter invalid byte sequences, making the 
decoder fail. That's my favorite case as it clearly shows the user 
something is wrong and needs his attention. The other case is worse 
because more subtle: Feeding UTF-8 into a Latin-1 decoder will pass but 
produces output that only a human can tell being garbage by closing 
analyzing the few Non-ASCII characters.


Taking this together, one might argue to have UTF-8 the default, not 
ISO-8859-1.


Almost anything that passes UTF-8 encoding constraints will be indeed 
UTF-8, as non-ASCII files that are not UTF-8 will almost certainly 
contain sequences not valid in UTF-8. So if a user fails to specify the 
encoding he uses, and if this encoding isn't UTF-8, then things will 
break for him. This has two advantages:


1. fail-fast behaviour. If there is a misconfiguration, the maven run 
will die, and the developer can fix the issue. You don't have to wait 
for some other developer complaining about garbled strings or a user 
complaining about a broken website until you can fix things.


2. promote unicode. While there are a lot of encosings out there for 
historic reasons, most of them suffer severe drawbacks in an 
international software project, because they either can't express all 
needed characters, or they are not common outside a small region. So 
while Taiwanese developers might be happy to develop an English/Chinese 
project in Big5, prospective american Contributors might not get their 
editor to load files as Big5. UTF-8, on the other hand, is used 
worldwide and provides the whole Unicode range.
For new projects, I guess UTF-8 would be a reasonable best practice, and 
making this best practice the default in maven might promote it.


Of course this conflicts with previous discussions about Latin1 ensuring 
that any project can get compiled, as it has no invalid byte sequences. 
The choice is whether, in the absence of configuration,


A) you want your compile to succeed all the time, possibly generating 
the wrong results, or


B) you want your build to fail in case of a misconfiguration (including 
missing configuration), but ensure correct results if it does not fail.


If I understood him correctly, Jason voted for A). I took his request 
for non-dying builds as a requirement and pointed out that this is 
possible with Latin1. Now that I think about it, I believe I would 
rather want B), as I'm all for failfast deterministic behaviour.


It should be checked whether plugins really die for invalid UTF-8 
sequences, and what the output looks like. If possible, plugins should 
point out that a misconfiguration of the encoding in the pom (either the 
plugin configuration or the proposed global configuration property) is 
possibly the cause of the error, if it's not a developer using another 
encoding.


Note that ASCII-only sources will compile cleanly no matter the default 
encoding, so all projects that don't need to worry about encoding won't 
be forced to do so. Only international projects where encoding is 
relevant will force their developers to either follow best practices or 
explicitely state their policy.


Greetings,
 Martin



signature.asc
Description: OpenPGP digital signature

Re: [VOTE] POM Element for Source File Encoding

2008-04-09 Thread Benjamin Bentmann


Taking this together, one might argue to have UTF-8 the default, not
ISO-8859-1.


In general, I completely agree with your preference to Unicode and fail-fast
behavior. If I had been involved when the Maven story started, I would have
proposed UTF-8 as the default value, no doubt.

As for today, I tried to consider consistency with existing behavior. The
Maven Site Plugin was already using Latin-1 as the default value for
inputEncoding and outputEncoding and so I proposed this for other plugins,
too. Indeed, one of the patches (MJAVADOC-165) was just released such that
already two plugins teach users this default value. Therefore I fear it
might be too late to introduce another default value. If the community
believes this change is worth the confusion caused on users, I'm the first
one running the other way round ;-)


It should be checked whether plugins really die for invalid UTF-8
sequences, and what the output looks like.


That's a good point. It appears we need to do some extra homework here: The
simplisitic use of InputStreamReader and OutputStreamReader will silently
convert unmappable byte sequences to a default character ('?', see also
[0]). I guess we could nicely hide the required implementation by means of
the existing methods in Reader-/WriterFactory from plexus-utils.


Note that ASCII-only sources will compile cleanly no matter the default
encoding


Most of time, but UTF-16 or EBCDIC have not even ASCII in common.


Benjamin


[0] http://java.sun.com/javase/6/docs/api/java/io/OutputStreamWriter.html


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [VOTE] POM Element for Source File Encoding

2008-04-09 Thread Brian E. Fox


As for today, I tried to consider consistency with existing behavior.
The
Maven Site Plugin was already using Latin-1 as the default value for
inputEncoding and outputEncoding and so I proposed this for other
plugins,
too. Indeed, one of the patches (MJAVADOC-165) was just released such
that
already two plugins teach users this default value. Therefore I fear it
might be too late to introduce another default value. If the community
believes this change is worth the confusion caused on users, I'm the
first
one running the other way round ;-)

Don't break existing builds. No regressions. ;-)



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] POM Element for Source File Encoding


Benjamin Bentmann wrote:
In general, I completely agree with your preference to Unicode and 
fail-fast

behavior. If I had been involved when the Maven story started, I would have
proposed UTF-8 as the default value, no doubt.

As for today, I tried to consider consistency with existing behavior. The
Maven Site Plugin was already using Latin-1 as the default value for
inputEncoding and outputEncoding and so I proposed this for other plugins,
too. Indeed, one of the patches (MJAVADOC-165) was just released such that
already two plugins teach users this default value. Therefore I fear it
might be too late to introduce another default value. If the community
believes this change is worth the confusion caused on users, I'm the first
one running the other way round ;-)


I see your point. Worth another vote? Or should this switch be postponed 
to 2.1, trading consistency in minor version upgrades for a longer time 
for these Latin1 defaults to be established?


Given the failfast nature of the UTF-8 default, we won't have to worry 
about the switch going unnoticed. Developers switching from a version 
defaulting to Latin1 to UTF-8 will notice the change immediately, and 
for development in a heterogenous environment they can simply override 
the super-POM with their own default.


So while I agree that a change in default either now or in the future is 
ugly, it is not taboo, and I believe woth the gain.



That's a good point. It appears we need to do some extra homework here: The
simplisitic use of InputStreamReader and OutputStreamReader will silently
convert unmappable byte sequences to a default character ('?', see also
[0]). I guess we could nicely hide the required implementation by means of
the existing methods in Reader-/WriterFactory from plexus-utils.


That works for plugins doing the conversion in code under our control. 
Other plugins that use external libraries or tools might be more difficult.



Note that ASCII-only sources will compile cleanly no matter the default
encoding


Most of time, but UTF-16 or EBCDIC have not even ASCII in common.


I was thinking about the default of the default, i.e. the value to be 
set in the super-POM. We certainly won't choose UTF-16 or EBCDIC for 
this global default, and as files encoded in UTF-16 or EBCDIC don't 
count as ASCII-only, my


 Martin



signature.asc
Description: OpenPGP digital signature

Re: [VOTE] POM Element for Source File Encoding

2008-04-09 Thread Jason van Zyl

All sounds fine. Just wanted you to think about the bigger picture in
mind.

Please do the work on a branch and go through the rigor of Brian's
example and make sure it works before you merge it into something we
would release to users. This is not an insignificant change.

On 9-Apr-08, at 10:36 AM, Benjamin Bentmann wrote:
Make sure you consider the case where you have people developing
the same code base all over the world, and the possible reasoning
of falling back to platform default encoding. Consider the team
spread across the US, Russia, and China and what do they do
normally?

This international spread of developers is in particular the case we
have in mind. I mean, how should such a team (say the Maven
community) deliver reliable build output if not all developers have
agreed to use the same file encoding for the sources? Say the US
devs would have ASCII as default encoding, the Europeans Latin-1 and
the Asians Big5 for our nice potpourri. Even if all have agreed to
use English for coding, you still might encounter Non-ASCII
characters that get messed up, e.g. in javadoc comments that carry
the name of the contributor/committer. Other developers might
experience build failures because of encoding mismatch, at best
other people's names are disfigured which is rather impolite.

The Eclipse folks had a similar problem [0]. The solution: Lock the
encoding down for the entire project.

Is it possible to specify an encoding in one place that doesn't
work somewhere else?

Yes, in theory you can have one user specify an encoding that
another user's JVM does not support. As the class javadoc about
Charset [1] states, only a few encodings - including Latin-1 and
UTF-8 - are required to be supported, although the reference
implementation from Sun supports quite more encodings [2]. However,
I don't consider this as a practical concern. Given that support for
UTF-8 is mandatory, there exists an encoding that can handle quite
any character people would like to enter and Java can handle. Hence
there exists a solution that works for everyone on the team.

I am fortunate in that I've never seen an encoding problem in Maven
personally. In your proposal you talk about aligning the encoding
value but my question in what cases have you found the default
encoding not working as you don't talk about that at all in the
proposal.

Well, choose your favorite from a search for encoding on all Maven
2 projects in JIRA ;-)

- http://jira.codehaus.org/browse/MNG-2932
- http://jira.codehaus.org/browse/MANTTASKS-14
- http://jira.codehaus.org/browse/MTAGLIST-27
- http://jira.codehaus.org/browse/MRELEASE-302
- http://jira.codehaus.org/browse/DOXIA-103
- http://jira.codehaus.org/browse/MCHANGES-71
- (about 300 more hits)

ASCII is quite safe, but anything which requires more than those 7
bits just needs special care.

Do you know what happens with all the tools that people use. Like
checking into all SCMs, and what happens when people checkout on
to their system, editors, IDEs. I'm merely suggesting that their
might be a reason most things fall back to the default encoding on
the system because it's generally been a hard thing to coral.

In principle you're right, most of the tools are intended for usage
with the platform's encoding. This seems to include the popular diff/
patch tools used by many SCMs, they have not really support for
different encodings [3] (yet another historic design flaw, next to
the two-digit year format).

Also, the SCMs themselves seem not to care about (file content)
encoding yet, I have found proposals for Subversion [5] and Bazaar
[4] but that's it. However, as far as I can tell, not knowing about
file encoding SCMs also do not perform any conversions on the file
content but simply assume a simple byte-to-char mapping like ASCII
when doing EOL normalization or keyword substitution.

As for editors and IDEs: Even this tiny thing Notepad from Windows
supports UTF-8 nowadays and I wouldn't call that an editor. Does
anybody know about a popular editor/IDE that calls itself mature but
does not allow to configure file encoding?

Benjamin

[0] https://bugs.eclipse.org/bugs/show_bug.cgi?id=132898
[1] http://java.sun.com/javase/6/docs/api/java/nio/charset/
Charset.html

[2] http://java.sun.com/j2se/1.4.2/docs/guide/intl/encoding.doc.html
[3]
http://www.gnu.org/software/diffutils/manual/html_mono/diff.html#Internationalization
[4]
http://bazaar-vcs.org/UnicodeSupport?action=showredirect=EncodingSupport#head-43c0111da063796da433179faaf8d065bda5c42e
[5] http://svn.haxx.se/dev/archive-2006-03/1182.shtml

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Thanks,

Jason

--
Jason van Zyl
Founder, Apache Maven
jason at sonatype

Re: [VOTE] POM Element for Source File Encoding

2008-04-09 Thread Benjamin Bentmann

Make sure you consider the case where you have people developing the same
code base all over the world, and the possible reasoning of falling back
to platform default encoding. Consider the team spread across the US,
Russia, and China and what do they do normally?

This international spread of developers is in particular the case we have in
mind. I mean, how should such a team (say the Maven community) deliver
reliable build output if not all developers have agreed to use the same file
encoding for the sources? Say the US devs would have ASCII as default
encoding, the Europeans Latin-1 and the Asians Big5 for our nice potpourri.
Even if all have agreed to use English for coding, you still might encounter
Non-ASCII characters that get messed up, e.g. in javadoc comments that carry
the name of the contributor/committer. Other developers might experience
build failures because of encoding mismatch, at best other people's names
are disfigured which is rather impolite.

The Eclipse folks had a similar problem [0]. The solution: Lock the encoding
down for the entire project.

Is it possible to specify an encoding in one place that doesn't work
somewhere else?

Yes, in theory you can have one user specify an encoding that another user's
JVM does not support. As the class javadoc about Charset [1] states, only a
few encodings - including Latin-1 and UTF-8 - are required to be supported,
although the reference implementation from Sun supports quite more encodings
[2]. However, I don't consider this as a practical concern. Given that
support for UTF-8 is mandatory, there exists an encoding that can handle
quite any character people would like to enter and Java can handle. Hence
there exists a solution that works for everyone on the team.

I am fortunate in that I've never seen an encoding problem in Maven
personally. In your proposal you talk about aligning the encoding value
but my question in what cases have you found the default encoding not
working as you don't talk about that at all in the proposal.

Well, choose your favorite from a search for encoding on all Maven 2
projects in JIRA ;-)

ASCII is quite safe, but anything which requires more than those 7 bits just
needs special care.

Do you know what happens with all the tools that people use. Like
checking into all SCMs, and what happens when people checkout on to their
system, editors, IDEs. I'm merely suggesting that their might be a reason
most things fall back to the default encoding on the system because it's
generally been a hard thing to coral.

In principle you're right, most of the tools are intended for usage with the
platform's encoding. This seems to include the popular diff/patch tools used
by many SCMs, they have not really support for different encodings [3] (yet
another historic design flaw, next to the two-digit year format).

Also, the SCMs themselves seem not to care about (file content) encoding
yet, I have found proposals for Subversion [5] and Bazaar [4] but that's it.
However, as far as I can tell, not knowing about file encoding SCMs also do
not perform any conversions on the file content but simply assume a simple
byte-to-char mapping like ASCII when doing EOL normalization or keyword
substitution.

As for editors and IDEs: Even this tiny thing Notepad from Windows
supports UTF-8 nowadays and I wouldn't call that an editor. Does anybody
know about a popular editor/IDE that calls itself mature but does not allow
to configure file encoding?

Benjamin

[0] https://bugs.eclipse.org/bugs/show_bug.cgi?id=132898
[1] http://java.sun.com/javase/6/docs/api/java/nio/charset/Charset.html
[2] http://java.sun.com/j2se/1.4.2/docs/guide/intl/encoding.doc.html
[3]
http://www.gnu.org/software/diffutils/manual/html_mono/diff.html#Internationalization
[4]
http://bazaar-vcs.org/UnicodeSupport?action=showredirect=EncodingSupport#head-43c0111da063796da433179faaf8d065bda5c42e
[5] http://svn.haxx.se/dev/archive-2006-03/1182.shtml

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] POM Element for Source File Encoding

2008-04-09 Thread Benjamin Bentmann


I see your point. Worth another vote? Or should this switch be postponed
to 2.1, trading consistency in minor version upgrades for a longer time
for these Latin1 defaults to be established?
[...]
So while I agree that a change in default either now or in the future is
ugly, it is not taboo, and I believe woth the gain.


Latin-1 being the default value was part of our proposal and not many people
complained about that nor changed their previous votes. So I believe another
vote won't deliver a different outcome.

Besides, Brian's honorable efforts to ban regressions are a good argument to
keep the already started route with Latin-1. It might not be the best
default value, but it's only a one liner to change it.


Benjamin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] POM Element for Source File Encoding

2008-04-09 Thread Milos Kleint

On Wed, Apr 9, 2008 at 7:36 PM, Benjamin Bentmann
[EMAIL PROTECTED] wrote:

  Make sure you consider the case where you have people developing the  same
 code base all over the world, and the possible reasoning of  falling back to
 platform default encoding. Consider the team spread  across the US, Russia,
 and China and what do they do normally?
 

  This international spread of developers is in particular the case we have
 in mind. I mean, how should such a team (say the Maven community) deliver
 reliable build output if not all developers have agreed to use the same file
 encoding for the sources? Say the US devs would have ASCII as default
 encoding, the Europeans Latin-1 and the Asians Big5 for our nice potpourri.
 Even if all have agreed to use English for coding, you still might encounter
 Non-ASCII characters that get messed up, e.g. in javadoc comments that carry
 the name of the contributor/committer. Other developers might experience
 build failures because of encoding mismatch, at best other people's names
 are disfigured which is rather impolite.

  The Eclipse folks had a similar problem [0]. The solution: Lock the
 encoding down for the entire project.\

just for the record, netbeans.org projects all use UTF-8. We have devs
in US, Czech rep, Russia and elsewhere. Netbeans allows to set default
encoding per project, for maven project I currently lookup how
maven-compiler-plugin is configured. If no configuration is in place I
fallback to platform encoding.

Encoding is not only different across countries but also across
platforms. While most Linux distributions use UTF-8, you get different
encoding based on what localized version of Windows you buy I think.
East european set is different from west europe. My Mac fallbacks to
something called MacRoman as default encoding.

Milos






  Is it possible to specify an encoding in one place that doesn't work
 somewhere else?
 

  Yes, in theory you can have one user specify an encoding that another
 user's JVM does not support. As the class javadoc about Charset [1] states,
 only a few encodings - including Latin-1 and UTF-8 - are required to be
 supported, although the reference implementation from Sun supports quite
 more encodings [2]. However, I don't consider this as a practical concern.
 Given that support for UTF-8 is mandatory, there exists an encoding that can
 handle quite any character people would like to enter and Java can handle.
 Hence there exists a solution that works for everyone on the team.



  I am fortunate in that I've never seen an encoding problem in Maven
 personally. In your proposal you talk about aligning the encoding  value but
 my question in what cases have you found the default  encoding not working
 as you don't talk about that at all in the  proposal.
 

  Well, choose your favorite from a search for encoding on all Maven 2
 projects in JIRA ;-)
  - http://jira.codehaus.org/browse/MNG-2932
  - http://jira.codehaus.org/browse/MANTTASKS-14
  - http://jira.codehaus.org/browse/MTAGLIST-27
  - http://jira.codehaus.org/browse/MRELEASE-302
  - http://jira.codehaus.org/browse/DOXIA-103
  - http://jira.codehaus.org/browse/MCHANGES-71
  - (about 300 more hits)

  ASCII is quite safe, but anything which requires more than those 7 bits
 just needs special care.



  Do you know what happens with all the tools that people use. Like checking
 into all SCMs, and what happens when people checkout on to  their system,
 editors, IDEs. I'm merely suggesting that their might be  a reason most
 things fall back to the default encoding on the system  because it's
 generally been a hard thing to coral.
 

  In principle you're right, most of the tools are intended for usage with
 the platform's encoding. This seems to include the popular diff/patch tools
 used by many SCMs, they have not really support for different encodings [3]
 (yet another historic design flaw, next to the two-digit year format).

  Also, the SCMs themselves seem not to care about (file content) encoding
 yet, I have found proposals for Subversion [5] and Bazaar [4] but that's it.
 However, as far as I can tell, not knowing about file encoding SCMs also do
 not perform any conversions on the file content but simply assume a simple
 byte-to-char mapping like ASCII when doing EOL normalization or keyword
 substitution.

  As for editors and IDEs: Even this tiny thing Notepad from Windows
 supports UTF-8 nowadays and I wouldn't call that an editor. Does anybody
 know about a popular editor/IDE that calls itself mature but does not allow
 to configure file encoding?


  Benjamin


  [0] https://bugs.eclipse.org/bugs/show_bug.cgi?id=132898
  [1] http://java.sun.com/javase/6/docs/api/java/nio/charset/Charset.html
  [2] http://java.sun.com/j2se/1.4.2/docs/guide/intl/encoding.doc.html
  [3]
 http://www.gnu.org/software/diffutils/manual/html_mono/diff.html#Internationalization
  [4]

Re: [VOTE] POM Element for Source File Encoding

2008-04-09 Thread Hervé BOUTEMY

Le mercredi 09 avril 2008, Benjamin Bentmann a écrit :
  I see your point. Worth another vote? Or should this switch be postponed
  to 2.1, trading consistency in minor version upgrades for a longer time
  for these Latin1 defaults to be established?
  [...]
  So while I agree that a change in default either now or in the future is
  ugly, it is not taboo, and I believe woth the gain.

 Latin-1 being the default value was part of our proposal and not many
 people complained about that nor changed their previous votes. So I believe
 another vote won't deliver a different outcome.

 Besides, Brian's honorable efforts to ban regressions are a good argument
 to keep the already started route with Latin-1. It might not be the best
 default value, but it's only a one liner to change it.
I have one argument in favor of ISO-8859-1 as default: it's the default 
encoding of properties files, as defined by JDK java.util.Properties class.
When Maven will be JDK 1.5+, we'll be able to switch to XML properties files, 
and then no problem for UTF-8 as default...



 Benjamin


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] POM Element for Source File Encoding

2008-04-09 Thread Hervé BOUTEMY

Le mercredi 09 avril 2008, Jason van Zyl a écrit :
 All sounds fine. Just wanted you to think about the bigger picture in
 mind.

 Please do the work on a branch and go through the rigor of Brian's
 example and make sure it works before you merge it into something we
 would release to users. This is not an insignificant change.
I created http://svn.apache.org/viewvc/maven/sandbox/branches/MNG-2216/ 
with javadoc and jxr plugins branches to test the change, and sample use 
case.

Isn't it sufficient?

Hervé


 On 9-Apr-08, at 10:36 AM, Benjamin Bentmann wrote:
  Make sure you consider the case where you have people developing
  the  same code base all over the world, and the possible reasoning
  of  falling back to platform default encoding. Consider the team
  spread  across the US, Russia, and China and what do they do
  normally?
 
  This international spread of developers is in particular the case we
  have in mind. I mean, how should such a team (say the Maven
  community) deliver reliable build output if not all developers have
  agreed to use the same file encoding for the sources? Say the US
  devs would have ASCII as default encoding, the Europeans Latin-1 and
  the Asians Big5 for our nice potpourri. Even if all have agreed to
  use English for coding, you still might encounter Non-ASCII
  characters that get messed up, e.g. in javadoc comments that carry
  the name of the contributor/committer. Other developers might
  experience build failures because of encoding mismatch, at best
  other people's names are disfigured which is rather impolite.
 
  The Eclipse folks had a similar problem [0]. The solution: Lock the
  encoding down for the entire project.
 
  Is it possible to specify an encoding in one place that doesn't
  work somewhere else?
 
  Yes, in theory you can have one user specify an encoding that
  another user's JVM does not support. As the class javadoc about
  Charset [1] states, only a few encodings - including Latin-1 and
  UTF-8 - are required to be supported, although the reference
  implementation from Sun supports quite more encodings [2]. However,
  I don't consider this as a practical concern. Given that support for
  UTF-8 is mandatory, there exists an encoding that can handle quite
  any character people would like to enter and Java can handle. Hence
  there exists a solution that works for everyone on the team.
 
  I am fortunate in that I've never seen an encoding problem in Maven
  personally. In your proposal you talk about aligning the encoding
  value but my question in what cases have you found the default
  encoding not working as you don't talk about that at all in the
  proposal.
 
  Well, choose your favorite from a search for encoding on all Maven
  2 projects in JIRA ;-)
  - http://jira.codehaus.org/browse/MNG-2932
  - http://jira.codehaus.org/browse/MANTTASKS-14
  - http://jira.codehaus.org/browse/MTAGLIST-27
  - http://jira.codehaus.org/browse/MRELEASE-302
  - http://jira.codehaus.org/browse/DOXIA-103
  - http://jira.codehaus.org/browse/MCHANGES-71
  - (about 300 more hits)
 
  ASCII is quite safe, but anything which requires more than those 7
  bits just needs special care.
 
  Do you know what happens with all the tools that people use. Like
  checking into all SCMs, and what happens when people checkout on
  to  their system, editors, IDEs. I'm merely suggesting that their
  might be  a reason most things fall back to the default encoding on
  the system  because it's generally been a hard thing to coral.
 
  In principle you're right, most of the tools are intended for usage
  with the platform's encoding. This seems to include the popular diff/
  patch tools used by many SCMs, they have not really support for
  different encodings [3] (yet another historic design flaw, next to
  the two-digit year format).
 
  Also, the SCMs themselves seem not to care about (file content)
  encoding yet, I have found proposals for Subversion [5] and Bazaar
  [4] but that's it. However, as far as I can tell, not knowing about
  file encoding SCMs also do not perform any conversions on the file
  content but simply assume a simple byte-to-char mapping like ASCII
  when doing EOL normalization or keyword substitution.
 
  As for editors and IDEs: Even this tiny thing Notepad from Windows
  supports UTF-8 nowadays and I wouldn't call that an editor. Does
  anybody know about a popular editor/IDE that calls itself mature but
  does not allow to configure file encoding?
 
 
  Benjamin
 
 
  [0] https://bugs.eclipse.org/bugs/show_bug.cgi?id=132898
  [1] http://java.sun.com/javase/6/docs/api/java/nio/charset/
  Charset.html
  [2] http://java.sun.com/j2se/1.4.2/docs/guide/intl/encoding.doc.html
  [3]
  http://www.gnu.org/software/diffutils/manual/html_mono/diff.html#Internat
 ionalization [4]
  http://bazaar-vcs.org/UnicodeSupport?action=showredirect=EncodingSupport
 #head-43c0111da063796da433179faaf8d065bda5c42e [5]
  http://svn.haxx.se/dev/archive-2006-03/1182.shtml

Re: [VOTE] POM Element for Source File Encoding


Paul Benedict wrote:

My only concern is that the encoding kind of assumes one kind of source
file.


We are well aware that different kind of text files may use different
encodings. A simple example is using UTF-8 for Java source files and Latin-1
for properties files.

However, the primary goal of the proposal is to replace the default encoding
defined by the JVM (platform-dependent) with a value defined by the POM
(platform-independent).

Hence, we started off with a single default value. The emphasis lies on
*default*, i.e. the proposed POM property/element is not intended as the
final means to configure the employed file encoding throughout the entire
project. It is just a value plugins can use to initialize their
configuration in case the user did not explicitly specify an encoding.


I am never in a position to have multiple encodings on my projects


And I would argue that not too few people follow the same approach.
Otherwise I can hardly understand why users did not already complain about
those plugins don't provide an encoding parameter at all yet. Besides, not
every IDE allows users to configure different file encodings in a single
project so this seems really the major use case.


but I suppose if you're compiling many differrent types of sources, people
would want to tie the source to the extension type.


A file extension is just one method to distinguish files, another one is
context of use. I believe that having the possibility to configure file
encoding on a per plugin basis is good enough to capture different types of
files.

If really someday the need to setup encodings per file extension arises, we 
can think more closely about that. But even then, I wouldn't like to write 
something like this in my POM to lock down the encoding for every file 
extension that might hang around in the project:

 fileEncodings
   fileEncoding
 extensionstxt,java,groovy,aj,bsh,apt,.../extensions
 nameUTF-8/name
   fileEncoding
 /fileEncodings
I would want to have a single default value to catch the major case and this 
default value should in no case depend on my JVM. So I'm back on 
${project.build.sourceEncoding}.



Benjamin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] POM Element for Source File Encoding


Jason van Zyl wrote:

Would being able to detect the encoding help with making this less
complicated. Something JChardet?


I'm not really sure what you meant to say. JChardet is a library that 
performs a best *guess* on file encoding by peeking at a byte stream. We 
don't want to base our builds on heuristics, don't we?



Benjamin 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] POM Element for Source File Encoding


Hervé Boutemy wrote:

this one is more tricky, even if the change in pom.xml is a simple
addition of
an element... Don't really know how to handle this without breaking things
for Maven 2.0 when an artifact with this addition is deployed to a
repository.


Handling POM additions is a more general concern and not really the point of
our proposal. For Maven 2.0.x, adding a normal property
 properties
project.build.sourceEncoding.../project.build.sourceEncoding
 /properties
to the super POM won't hurt the model validation for 4.0.0. For now, the
simple question to answer is will the element by named like proposed? Once
we get consensus about this name, we can continue to patch the plugins to
use this property for the parameters, knowing that it will be
forward-compatible with Maven 2.1.

For Maven 2.1, a new model version will be introduced. Users that choose to
employ this version will always experience build failures with Maven 2.0.x
due to the failed model validation. Again, this is nothing specific to our
proposal about sourceEncoding. We just added another element to list of
required POM additions:
- custom profile activators
- site directory
- plugin management for reporting
- ...


The only risk is that the property chosen,
${project.build.sourceEncoding},
makes user think to a new element projectbuildsourceEncoding in the
pom


Yes, we will have to properly document this just like for the new import
scope.


Benjamin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] POM Element for Source File Encoding



On 8-Apr-08, at 1:09 AM, Benjamin Bentmann wrote:

Jason van Zyl wrote:

Would being able to detect the encoding help with making this less
complicated. Something JChardet?


I'm not really sure what you meant to say. JChardet is a library  
that performs a best *guess* on file encoding by peeking at a byte  
stream. We don't want to base our builds on heuristics, don't we?




If it's right most of the time, and it saves the user from having to  
know or worry about it then yes I would use it.




Benjamin

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Thanks,

Jason

--
Jason van Zyl
Founder,  Apache Maven
jason at sonatype dot com
--

We all have problems. How we deal with them is a measure of our worth.

-- Unknown 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] POM Element for Source File Encoding


Jason van Zyl wrote:

If it's right most of the time, and it saves the user from having to  know
or worry about it then yes I would use it.


Could you elaborate this a little more. Say we start easy and have a build
with just about 100 Java source files. Do you suggest to peek at each of
them before passing them to a tool like javac or just a subset and how
should this subset be determined? What should be done when the charset
detection reports different encodings for the set of files to process? Will
the charset detection happen over and over again for each plugin (javac,
javadoc, jxr)? What do you consider most of time, telling the various
ISO-8859 families apart is not really easy. My impression is that usage of
JChardet will significantly increase code complexity without giving me a
solid build.

Also, I believe it's a bad idea to free users from worrying about the
encoding. This would be similar to the doubtful magic the JRE provides with
its default encoding: It encourages developers to ignore the encoding issue,
leading to platform-dependent behavior. Platform-dependent Java code is a
bad practice and Maven, as far as I heard, aims at promoting best practices.
File encoding is a parameter affecting your build output just like the
source/target settings used for the compiler and hence should be explicitly
controlled.

As we talk about it: What is the agreed file encoding for the Maven sources
(MNGSITE-46)?


Benjamin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] POM Element for Source File Encoding

2008-04-08 Thread Milos Kleint

+1 on Benjamin's objections to detection.
It will slow down the build (possibly significantly) while providing
little added value.

Milos

On Tue, Apr 8, 2008 at 8:27 PM, Benjamin Bentmann
[EMAIL PROTECTED] wrote:
 Jason van Zyl wrote:

  If it's right most of the time, and it saves the user from having to  know
  or worry about it then yes I would use it.
 

  Could you elaborate this a little more. Say we start easy and have a build
  with just about 100 Java source files. Do you suggest to peek at each of
  them before passing them to a tool like javac or just a subset and how
  should this subset be determined? What should be done when the charset
  detection reports different encodings for the set of files to process? Will
  the charset detection happen over and over again for each plugin (javac,
  javadoc, jxr)? What do you consider most of time, telling the various
  ISO-8859 families apart is not really easy. My impression is that usage of
  JChardet will significantly increase code complexity without giving me a
  solid build.

  Also, I believe it's a bad idea to free users from worrying about the
  encoding. This would be similar to the doubtful magic the JRE provides with
  its default encoding: It encourages developers to ignore the encoding
 issue,
  leading to platform-dependent behavior. Platform-dependent Java code is a
  bad practice and Maven, as far as I heard, aims at promoting best
 practices.
  File encoding is a parameter affecting your build output just like the
  source/target settings used for the compiler and hence should be explicitly
  controlled.

  As we talk about it: What is the agreed file encoding for the Maven sources
  (MNGSITE-46)?




  Benjamin


  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] POM Element for Source File Encoding



On 8-Apr-08, at 11:27 AM, Benjamin Bentmann wrote:

Jason van Zyl wrote:
If it's right most of the time, and it saves the user from having  
to  know

or worry about it then yes I would use it.


Could you elaborate this a little more. Say we start easy and have a  
build
with just about 100 Java source files. Do you suggest to peek at  
each of

them before passing them to a tool like javac or just a subset and how
should this subset be determined?


It would be reasonable to assume the detection could be based on a  
subset. For an organization on one project you could reasonable assume  
the same encoding. That  would not be the case in an open source  
project as tools would vary.



What should be done when the charset
detection reports different encodings for the set of files to process?


What happens when the encoding is different then what is stated? Same  
problem really, in how to deal with the actual versus declared.



Will
the charset detection happen over and over again for each plugin  
(javac,
javadoc, jxr)? What do you consider most of time, telling the  
various
ISO-8859 families apart is not really easy. My impression is that  
usage of
JChardet will significantly increase code complexity without giving  
me a

solid build.


That would depend on what kinds of problems can arise if things are  
not consistent.





Also, I believe it's a bad idea to free users from worrying about the
encoding.


You have to deal with the very real possibility no one is going to set  
it, not know what is, and report issues related to encoding even if  
the whole system works.


I'm all for literal and declarative. In practice this does not happen  
all the time. I also didn't say use one over the other, but the  
detection may help in cases where it's not stated. The JChardet  
library was created for a reason, and this looks like one of them.


For the system you are proposing there would be touch points at which  
you would look for encoding parameters. If those values are not state  
you will need a strategy to detect or you will never be able to  
support any encoding alignment in older versions of Maven without the  
encoding parameterization.



This would be similar to the doubtful magic the JRE provides with
its default encoding: It encourages developers to ignore the  
encoding issue,
leading to platform-dependent behavior. Platform-dependent Java code  
is a
bad practice and Maven, as far as I heard, aims at promoting best  
practices.


Of course it is, but that doesn't negate that fact people don't  
necessarily follow best practices. But you are


1) going to need to deal with versions of Maven that don't support  
this encoding parameterization, and

2) you're going to have to deal with the case where it's stated wrong

We should know combinations of encoding parameter that will work  
together and if they aren't stated, or stated wrong it's better to  
provide some fallback instead of just dying.




File encoding is a parameter affecting your build output just like the
source/target settings used for the compiler and hence should be  
explicitly

controlled.



Absolutely, but look at all the questions on the mailing list that  
expect many of these things to just be detected. People using Java 1.5  
just expect you to be able to compile 1.5 code. That's not the case.  
Users in this case expect the right thing to happen.
I'm willing to bet you if you asked the average user about encoding,  
they would have no clue and wonder why it wasn't detected.


It was a suggestion based on experience of typical users.




As we talk about it: What is the agreed file encoding for the Maven  
sources

(MNGSITE-46)?


Benjamin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Thanks,

Jason

--
Jason van Zyl
Founder,  Apache Maven
jason at sonatype dot com
--

We all have problems. How we deal with them is a measure of our worth.

-- Unknown 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] POM Element for Source File Encoding



On 8-Apr-08, at 11:11 AM, Milos Kleint wrote:

+1 on Benjamin's objections to detection.
It will slow down the build (possibly significantly) while providing
little added value.


Possibly, but you're guessing.

Obviously checking the encoding on every file would be unwise. Trying  
to detect where it's not provided (mistakes), or can't be provided  
(not supported as an option in the model) you're going to have to do  
something. So what are you going to do in those cases?





Milos

On Tue, Apr 8, 2008 at 8:27 PM, Benjamin Bentmann
[EMAIL PROTECTED] wrote:

Jason van Zyl wrote:

If it's right most of the time, and it saves the user from having  
to  know

or worry about it then yes I would use it.



Could you elaborate this a little more. Say we start easy and have  
a build
with just about 100 Java source files. Do you suggest to peek at  
each of
them before passing them to a tool like javac or just a subset and  
how
should this subset be determined? What should be done when the  
charset
detection reports different encodings for the set of files to  
process? Will
the charset detection happen over and over again for each plugin  
(javac,
javadoc, jxr)? What do you consider most of time, telling the  
various
ISO-8859 families apart is not really easy. My impression is that  
usage of
JChardet will significantly increase code complexity without giving  
me a

solid build.

Also, I believe it's a bad idea to free users from worrying about the
encoding. This would be similar to the doubtful magic the JRE  
provides with

its default encoding: It encourages developers to ignore the encoding
issue,
leading to platform-dependent behavior. Platform-dependent Java  
code is a

bad practice and Maven, as far as I heard, aims at promoting best
practices.
File encoding is a parameter affecting your build output just like  
the
source/target settings used for the compiler and hence should be  
explicitly

controlled.

As we talk about it: What is the agreed file encoding for the Maven  
sources

(MNGSITE-46)?




Benjamin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Thanks,

Jason

--
Jason van Zyl
Founder,  Apache Maven
jason at sonatype dot com
--

happiness is like a butterfly: the more you chase it, the more it will
elude you, but if you turn your attention to other things, it will come
and sit softly on your shoulder ...

-- Thoreau 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] POM Element for Source File Encoding

2008-04-08 Thread Martin von Gagern


+1 for the original proposal, if a newcomer like me is allowed to vote.

The concept with the property, which can be set with the properties 
until the model is updated, and which can be the default expression for 
affected plugins, is simply elegant.


Jason van Zyl wrote:
It would be reasonable to assume the detection could be based on a 
subset. For an organization on one project you could reasonable assume 
the same encoding. That  would not be the case in an open source project 
as tools would vary.


Suppose you have a huge source tree, mostly english ASCII, but somewhere 
in there there is a single degree sign, '\u00b0'. How would you detect 
it, short of scanning every ASCII file until you hit that one?


I support concerns here that the cost of encoding detection may in many 
cases be prohibitively high. Maven runs too slow as it is, imho. You 
could of course write an encoding detection plugin which could examine 
the code and set the required property accordingly. But enabling that by 
default feels bad to me.


What happens when the encoding is different then what is stated? Same 
problem really, in how to deal with the actual versus declared.


Up to the plugins, I guess, as it is now. No change there, only a 
central place to set defaults for all plugins. Of course you could write 
an encoding checking plugin which ensures that your sources are valid in 
the specified encoding.



My impression is that usage of
JChardet will significantly increase code complexity without giving me a
solid build.


That would depend on what kinds of problems can arise if things are not 
consistent.


There are three possible cases:
1. code agrees with setting = all right
2. code disagrees with setting, but is still valid under specified 
encoding = Mojibake
3. code is invalid under specified encoding = exception or unmappable 
character symbol, depending on context. Exception maybe handled by plugin.


By specifying ISO-8859-1 as default input encoding, there are no 
unmappable characters, avoiding case 3. All input should be readable, 
though the output generated from this might not look as expected.


It should be noted that plugins that generate code to be used by other 
plugins should have their output encoding default to the general input 
encoding, so that there are no breaks in the chain.


As Jason writes about consistency, I guess the danger of inconsistent 
input handling, as different plugins might be configured to read it 
using different charsets, is exactly the kind of inconsistency to be 
addressed by this proposal, so I'd expect more consistency after it has 
been implemented, not less.


Greetings,
 Martin von Gagern




signature.asc
Description: OpenPGP digital signature

Re: [VOTE] POM Element for Source File Encoding

2008-04-08 Thread Hervé BOUTEMY

Le mardi 08 avril 2008, Paul Benedict a écrit :
 In Commons Validator, we updated the DTD even in point releases. I don't
 see the harm in doing the same here. After all, if the POM is 4.0.0, why
 not create a 4.0.1? It sounds like Maven 2 will have a 4.1 version.

 Paul
because if you use 4.0.1 for your project, and upload your component to a 
repository, everybody depending on your component will need to support 4.0.1 
or they'll get a failure parsing a 4.0.1 pom with their Maven runtime 
supporting only 4.0.0 pom

to support a 4.1 version, I imagine there will be some trick to implement to 
upload simultaneously the original 4.1 pom version to the repository and a 
generated 4.0.0 for compatibility with Maven 2.0.x

Hervé


 On Mon, Apr 7, 2008 at 6:03 PM, Jason van Zyl [EMAIL PROTECTED] wrote:
  On 7-Apr-08, at 3:58 PM, Jason van Zyl wrote:
   Would being able to detect the encoding help with making this less
   complicated. Something JChardet?
 
  Sorry, something like JCharet:
 
  http://jchardet.sourceforge.net/
 
   On 7-Apr-08, at 2:31 PM, Hervé BOUTEMY wrote:
Le dimanche 06 avril 2008, Jason van Zyl a écrit :
 I specifically meant the core changes, but I would still
 recommending
 what Milos did which was to create branches for a few of the
 affected
 plugins to try it all together.
   
ok, I created
http://svn.apache.org/viewvc/maven/sandbox/branches/MNG-2216/
with javadoc and jxr plugins branches to test the change, and sample
use
case.
   
 Most certainly to test new elements in
   
 the POM you need to use a branch because we still don't have a
 strategy for dealing with model changes.
   
this one is more tricky, even if the change in pom.xml is a simple
addition of
an element... Don't really know how to handle this without breaking
things
for Maven 2.0 when an artifact with this addition is deployed to a
repository.
   
 If plugins can be changed, used with the existing versions of Maven
   
 with no disruption then do it in-situ.
   
No problem here, no disruption, as proven by the test.
The only risk is that the property chosen,
${project.build.sourceEncoding},
makes user think to a new element projectbuildsourceEncoding in
the
pom, but we still don't know how we will implement it: we bet on a
solution
we don't have currently.
   
Hervé
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
  
   Thanks,
  
   Jason
  
   --
   Jason van Zyl
   Founder,  Apache Maven
   jason at sonatype dot com
   --
  
   A man enjoys his work when he understands the whole and when he
   is responsible for the quality of the whole
  
   -- Christopher Alexander, A Pattern Language
  
  
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
 
  Thanks,
 
  Jason
 
  --
  Jason van Zyl
  Founder,  Apache Maven
  jason at sonatype dot com
  --
 
  Simplex sigillum veri. (Simplicity is the seal of truth.)
 
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] POM Element for Source File Encoding

2008-04-08 Thread Hervé BOUTEMY

Le mardi 08 avril 2008, Martin von Gagern a écrit :
 +1 for the original proposal, if a newcomer like me is allowed to vote.

 The concept with the property, which can be set with the properties
 until the model is updated, and which can be the default expression for
 affected plugins, is simply elegant.
+1

 I support concerns here that the cost of encoding detection may in many
 cases be prohibitively high. Maven runs too slow as it is, imho. You
 could of course write an encoding detection plugin which could examine
 the code and set the required property accordingly. But enabling that by
 default feels bad to me.
+1
encoding detection, guessing encoding, is unreliable by nature
Why not in a browser, where:
- encoding can change on every page
- a user looks at the rendered characters, sees a problem easily and fixes the 
value by simply trying another value and seeing if it is better

But embedded in Maven, where encoding is not so volatile and the consequences 
of a bad guess will be more subtle (for example as the classes compiled will 
be run and display bad output), I find it a really bad idea.

 It should be noted that plugins that generate code to be used by other
 plugins should have their output encoding default to the general input
 encoding, so that there are no breaks in the chain.
it's noted in the proposal, in the list of affected plugins (modello, for 
example, which generates Java source code)

 As Jason writes about consistency, I guess the danger of inconsistent
 input handling, as different plugins might be configured to read it
 using different charsets, is exactly the kind of inconsistency to be
 addressed by this proposal, so I'd expect more consistency after it has
 been implemented, not less.
+1

until now, few people did care about encoding for non XML sources, and it 
worked: yes, that's the magic of platform encoding (the drawback is 
reproducibility)

IMHO, the best hint for a user choose his encoding when the default ISO-8859-1 
isn't a good valuie for him, is displaying platform encoding (in mvn -v 
output for example): it's easy, reliable, and corresponds to the value he 
would have got before the change

Hervé

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] POM Element for Source File Encoding


Martin von Gagern wrote:

if a newcomer like me is allowed to vote.


The more people participate in a discussion, the more likely is the result
to match public consensus rather than individual's preferences.


Suppose you have a huge source tree, mostly english ASCII, but somewhere
in there there is a single degree sign, '\u00b0'. How would you detect
it, short of scanning every ASCII file until you hit that one?


Exactly, if the automatic guessing should have any chance to deliver the
proper result, it's doomed to scan all the files and this is additional I/O.
Please remember, I/O is one of the most expensive operations in terms of
time, in particular with a Maven build being quite sequential.


You could of course write an encoding detection plugin which could examine
the code and set the required property accordingly.


Personally, I don't see the use case for that. If there are really users out
there that don't know what file encoding they are using when writing up
their sources, they are most probably happy with the proposed default value
of Latin-1. Alternatively, this encoding detection plugin could be as simple
as printing out the Java system property ${file.encoding} which obviously
worked well enough for the user.

For those users that know about file encoding, it won't be a problem to
specify this in the POM. In particular, those users will not fail to specify
the right encoding, unlike a dumb machine which merely tests whether a
particular byte stream obeys the syntax rules of an encoding.


Benjamin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] POM Element for Source File Encoding


Jason van Zyl wrote:

Possibly, but you're guessing.


Guessing about how much it will be slower, yes, guessing that it will be
slower, no. Additional work, additional time. Wouldn't you agree? Then the 
question becomes, is it worth to take this overhead, or how much benefit do 
you expect from the encoding guess over the simple default value.



Obviously checking the encoding on every file would be unwise.


As Martin nicely illustrated, you would exactly have to do this. Otherwise,
you could simply shortcut the detection to ASCII because that's what you see
most of the time. The characters that require the proper encoding are in the
minority. My passion for this proposal is not about works most of the
time, I would like to see works always.


Trying to detect where it's not provided (mistakes)


We proposed to set a default value in the super POM such that the encoding
will always be specified. To handle Maven 2.0.9-, we further proposed that
each plugin consistently falls back to this agreed default value in case it
doesn't get a value from the POM. Is there a case I am missing?


or can't be provided (not supported as an option in the model) you're
going to have to do something. So what are you going to do in those cases?


I am not sure what you mean when referring to model. Are you referring to
a plugin that is currently not aware of the encoding issue, i.e. simply uses
the JVM's default value and does not provide a configuration parameter to
the user? For this case, we should simply fix this plugin and release a new
version of it to deliver consistently high quality software.


Benjamin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] POM Element for Source File Encoding

2008-04-08 Thread Paul Benedict

Herve,

Just a proposal: Maven could loosen its parsing rules when it detects
versions greater than it is configured to accept. This can't be without
limits, of course, perhaps in the range of a single point release: 4.0 =
4.0.x  4.1. But perhaps within the 4.0.x series, it would accept undeclared
elements instead of strict parsing against the XSD. So if a 4.0.0 parser is
given a 4.0.1 POM, it must at least match 4.0.0 but also accepts undeclared
elements.

Forward compatibility would be nice.

Paul

On Tue, Apr 8, 2008 at 4:16 PM, Hervé BOUTEMY [EMAIL PROTECTED] wrote:

 Le mardi 08 avril 2008, Paul Benedict a écrit :
  In Commons Validator, we updated the DTD even in point releases. I don't
  see the harm in doing the same here. After all, if the POM is 4.0.0, why
  not create a 4.0.1? It sounds like Maven 2.1 will have a 4.1 version.
 
  Paul



 because if you use 4.0.1 for your project, and upload your component to a
 repository, everybody depending on your component will need to support
 4.0.1
 or they'll get a failure parsing a 4.0.1 pom with their Maven runtime
 supporting only 4.0.0 pom

 to support a 4.1 version, I imagine there will be some trick to implement
 to
 upload simultaneously the original 4.1 pom version to the repository and a
 generated 4.0.0 for compatibility with Maven 2.0.x

 Hervé

Re: [VOTE] POM Element for Source File Encoding


Jason van Zyl wrote:
What happens when the encoding is different then what is stated? Same 
problem really, in how to deal with the actual versus declared.


If the declared encoding does not match the actual one, I simply call this 
an user error. Either he explicitly set the wrong value or forgot to 
overwrite the default value. With regard to user errors, my general 
suggestion is to fail the build. This unforgiving attitude should not be 
that unfamilar to users: It has been chosen for a popular format like XML 
which is also employed by Maven for a few files.


That would depend on what kinds of problems can arise if things are  not 
consistent.


The problems depend on the encodings: If one feeds Latin-1 into an UTF-8 
decoder, you most likely encouter invalid byte sequences, making the decoder 
fail. That's my favorite case as it clearly shows the user something is 
wrong and needs his attention. The other case is worse because more subtle: 
Feeding UTF-8 into a Latin-1 decoder will pass but produces output that only 
a human can tell being garbage by closing analyzing the few Non-ASCII 
characters.


You have to deal with the very real possibility no one is going to set 
it, not know what is, and report issues related to encoding even if  the 
whole system works.


I don't think that lack of knowledge is a state that should be supported. 
Java is an international platform, designed for platform-independence (more 
or less). If developers don't know about file encoding, they are likely 
producing bad code. Therefore, I am easy to say: Have users report issues 
about encoding and let's tell them how to do it properly, i.e. teach them 
another best practice. Then, maybe some day, we won't ever face programs 
that were written without file encoding in mind ;-)


For the system you are proposing there would be touch points at which  you 
would look for encoding parameters. If those values are not state  you 
will need a strategy to detect or you will never be able to  support any 
encoding alignment in older versions of Maven without the  encoding 
parameterization.


Hm, maybe we talk a lot just because we didn't illustrate our proposal 
properly: A key point is that there will *always* be a specific encoding 
value. The proposal expects all affected plugins to fall back to Latin-1 (or 
whatever, just a fixed value) if they don't get an explicit setting from the 
POM. I.e. once a user employs a particular version of a plugin, he can 
immediately tell which encoding it will use to process text files. In other 
words, he can immediately tell whether the plugin will behave correctly. In 
contrast, if we followed your suggestion with encoding guessing, the user 
would have to try out the plugin and verify that is guessed correctly. The 
encoding parameterization is primarily a task for the individual plugins and 
not bound to a Maven version. Having a dedicated POM property/element is 
just sugar, not a requirement. The important aspect is unification of 
encoding handling in the plugins.


Of course it is, but that doesn't negate that fact people don't 
necessarily follow best practices.


That's right. But I believe we have to distinguish bad practice and mistake. 
What people call good practice might be controversial, but stating that a 
Latin-1 encoded file should be read using UTF-8 is in general just wrong and 
leaves no room for discussion. Hence I believe that Maven has all right to 
fail the build and report an error if a user does not properly setup the 
file encoding, forcing users to fix the error.


Absolutely, but look at all the questions on the mailing list that  expect 
many of these things to just be detected.


I don't want to upset those users but I believe that not every request is 
justified and can be rejected if only properly backed by a reasonable 
argument. Until somebody shows me a feasible and *reliable* algo to tell 
ISO-8859-1 and ISO-8859-15 apart, I don't want the dumb machine to start 
guessing. I, and I hope all the other users, aim for a correct build and if 
the machine cannot derive the required parameters, it is a user's duty to 
specify the proper values. Besides, this is nothing that really hurts much, 
add the line to your POM and be fine for the rest of your life.



Benjamin 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] POM Element for Source File Encoding

IMHO, the best hint for a user choose his encoding when the default 
ISO-8859-1

isn't a good valuie for him, is displaying platform encoding (in mvn -v
output for example): it's easy, reliable, and corresponds to the value he
would have got before the change


+1, just created MNG-3509 for this.


Benjamin 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [VOTE] POM Element for Source File Encoding