Re: [VOTE] POM Element for Source File Encoding

2008-04-09 Thread Martin von Gagern

Benjamin Bentmann wrote:
In general, I completely agree with your preference to Unicode and 
fail-fast

behavior. If I had been involved when the Maven story started, I would have
proposed UTF-8 as the default value, no doubt.

As for today, I tried to consider consistency with existing behavior. The
Maven Site Plugin was already using Latin-1 as the default value for
inputEncoding and outputEncoding and so I proposed this for other plugins,
too. Indeed, one of the patches (MJAVADOC-165) was just released such that
already two plugins teach users this default value. Therefore I fear it
might be too late to introduce another default value. If the community
believes this change is worth the confusion caused on users, I'm the first
one running the other way round ;-)


I see your point. Worth another vote? Or should this switch be postponed 
to 2.1, trading consistency in minor version upgrades for a longer time 
for these Latin1 defaults to be established?


Given the failfast nature of the UTF-8 default, we won't have to worry 
about the switch going unnoticed. Developers switching from a version 
defaulting to Latin1 to UTF-8 will notice the change immediately, and 
for development in a heterogenous environment they can simply override 
the super-POM with their own default.


So while I agree that a change in default either now or in the future is 
ugly, it is not taboo, and I believe woth the gain.



That's a good point. It appears we need to do some extra homework here: The
simplisitic use of InputStreamReader and OutputStreamReader will silently
convert unmappable byte sequences to a default character ('?', see also
[0]). I guess we could nicely hide the required implementation by means of
the existing methods in Reader-/WriterFactory from plexus-utils.


That works for plugins doing the conversion in code under our control. 
Other plugins that use external libraries or tools might be more difficult.



Note that ASCII-only sources will compile cleanly no matter the default
encoding


Most of time, but UTF-16 or EBCDIC have not even ASCII in common.


I was thinking about the default of the default, i.e. the value to be 
set in the super-POM. We certainly won't choose UTF-16 or EBCDIC for 
this global default, and as files encoded in UTF-16 or EBCDIC don't 
count as ASCII-only, my


 Martin



signature.asc
Description: OpenPGP digital signature


Re: [VOTE] POM Element for Source File Encoding

2008-04-09 Thread Martin von Gagern

Benjamin Bentmann wrote:
With regard to user errors, my general 
suggestion is to fail the build. This unforgiving attitude should not be 
that unfamilar to users: It has been chosen for a popular format like 
XML which is also employed by Maven for a few files.


The problems depend on the encodings: If one feeds Latin-1 into an UTF-8 
decoder, you most likely encouter invalid byte sequences, making the 
decoder fail. That's my favorite case as it clearly shows the user 
something is wrong and needs his attention. The other case is worse 
because more subtle: Feeding UTF-8 into a Latin-1 decoder will pass but 
produces output that only a human can tell being garbage by closing 
analyzing the few Non-ASCII characters.


Taking this together, one might argue to have UTF-8 the default, not 
ISO-8859-1.


Almost anything that passes UTF-8 encoding constraints will be indeed 
UTF-8, as non-ASCII files that are not UTF-8 will almost certainly 
contain sequences not valid in UTF-8. So if a user fails to specify the 
encoding he uses, and if this encoding isn't UTF-8, then things will 
break for him. This has two advantages:


1. fail-fast behaviour. If there is a misconfiguration, the maven run 
will die, and the developer can fix the issue. You don't have to wait 
for some other developer complaining about garbled strings or a user 
complaining about a broken website until you can fix things.


2. promote unicode. While there are a lot of encosings out there for 
historic reasons, most of them suffer severe drawbacks in an 
international software project, because they either can't express all 
needed characters, or they are not common outside a small region. So 
while Taiwanese developers might be happy to develop an English/Chinese 
project in Big5, prospective american Contributors might not get their 
editor to load files as Big5. UTF-8, on the other hand, is used 
worldwide and provides the whole Unicode range.
For new projects, I guess UTF-8 would be a reasonable best practice, and 
making this best practice the default in maven might promote it.


Of course this conflicts with previous discussions about Latin1 ensuring 
that any project can get compiled, as it has no invalid byte sequences. 
The choice is whether, in the absence of configuration,


A) you want your compile to succeed all the time, possibly generating 
the wrong results, or


B) you want your build to fail in case of a misconfiguration (including 
missing configuration), but ensure correct results if it does not fail.


If I understood him correctly, Jason voted for A). I took his request 
for non-dying builds as a requirement and pointed out that this is 
possible with Latin1. Now that I think about it, I believe I would 
rather want B), as I'm all for failfast deterministic behaviour.


It should be checked whether plugins really die for invalid UTF-8 
sequences, and what the output looks like. If possible, plugins should 
point out that a misconfiguration of the encoding in the pom (either the 
plugin configuration or the proposed global configuration property) is 
possibly the cause of the error, if it's not a developer using another 
encoding.


Note that ASCII-only sources will compile cleanly no matter the default 
encoding, so all projects that don't need to worry about encoding won't 
be forced to do so. Only international projects where encoding is 
relevant will force their developers to either follow best practices or 
explicitely state their policy.


Greetings,
 Martin



signature.asc
Description: OpenPGP digital signature


Re: [VOTE] POM Element for Source File Encoding

2008-04-09 Thread Martin von Gagern

Paul Benedict wrote:

Just a proposal: Maven could loosen its parsing rules when it detects
versions greater than it is configured to accept.

Forward compatibility would be nice.


For anyone seriously interested in interoperability , I suggest a look 
at http://www.w3.org/2005/05/xsd-versioning-resources.html , especially 
the use cases, which illustrate several issues quite well.


 Martin



signature.asc
Description: OpenPGP digital signature


Re: [VOTE] POM Element for Source File Encoding

2008-04-09 Thread Martin von Gagern

Benjamin Bentmann wrote:
You could of course write an encoding detection plugin which could 
examine the code and set the required property accordingly.


Personally, I don't see the use case for that. If there are really users 
out there that don't know what file encoding they are using when writing up

their sources, they are most probably happy with the proposed default value
of Latin-1. Alternatively, this encoding detection plugin could be as 
simple as printing out the Java system property ${file.encoding} which obviously

worked well enough for the user.


${file.encoding} will only work if the file originated on the same machine.

I think of semi-automatic conversions of inhomogenous code into maven. 
E.g. some teacher collects homework from his students as a bunch of zip 
files containing only source, has a script to turn each into a maven 
project, and a master project interacting with them, like letting them 
compete with one another or whatever. In this case one might wish to 
automatically detect the encoding of every module, especially in locales 
with several commonly used encodings, so that string literals in these 
classes are handled correctly without the students even knowing what an 
encoding is.


But that's a corner case, so I guess we should stop discussion about the 
use of such a program here, until someone actually requires it.


Greetings,
 Martin



signature.asc
Description: OpenPGP digital signature


Re: [VOTE] POM Element for Source File Encoding

2008-04-08 Thread Martin von Gagern

+1 for the original proposal, if a newcomer like me is allowed to vote.

The concept with the property, which can be set with the properties 
until the model is updated, and which can be the default expression for 
affected plugins, is simply elegant.


Jason van Zyl wrote:
It would be reasonable to assume the detection could be based on a 
subset. For an organization on one project you could reasonable assume 
the same encoding. That  would not be the case in an open source project 
as tools would vary.


Suppose you have a huge source tree, mostly english ASCII, but somewhere 
in there there is a single degree sign, '\u00b0'. How would you detect 
it, short of scanning every ASCII file until you hit that one?


I support concerns here that the cost of encoding detection may in many 
cases be prohibitively high. Maven runs too slow as it is, imho. You 
could of course write an encoding detection plugin which could examine 
the code and set the required property accordingly. But enabling that by 
default feels bad to me.


What happens when the encoding is different then what is stated? Same 
problem really, in how to deal with the actual versus declared.


Up to the plugins, I guess, as it is now. No change there, only a 
central place to set defaults for all plugins. Of course you could write 
an encoding checking plugin which ensures that your sources are valid in 
the specified encoding.



My impression is that usage of
JChardet will significantly increase code complexity without giving me a
solid build.


That would depend on what kinds of problems can arise if things are not 
consistent.


There are three possible cases:
1. code agrees with setting => all right
2. code disagrees with setting, but is still valid under specified 
encoding => Mojibake
3. code is invalid under specified encoding => exception or unmappable 
character symbol, depending on context. Exception maybe handled by plugin.


By specifying ISO-8859-1 as default input encoding, there are no 
unmappable characters, avoiding case 3. All input should be readable, 
though the output generated from this might not look as expected.


It should be noted that plugins that generate code to be used by other 
plugins should have their output encoding default to the general input 
encoding, so that there are no breaks in the chain.


As Jason writes about consistency, I guess the danger of inconsistent 
input handling, as different plugins might be configured to read it 
using different charsets, is exactly the kind of inconsistency to be 
addressed by this proposal, so I'd expect more consistency after it has 
been implemented, not less.


Greetings,
 Martin von Gagern




signature.asc
Description: OpenPGP digital signature


Compiling for API compliance

2008-04-08 Thread Martin von Gagern

Hi!

I would like to compile code not only for a given class file format
version, but also to the corresponding Java API specification. There 
should be different settings for main and test code, as the main code 
should be highly portable, while test code might make use of quick and 
dirty features that only became available more recently.


I had already started a mail with this subject in users@, heard that 
what I want to do isn't possible so far, and now I want to change things 
so it becomes possible.

http://www.nabble.com/Compiling-for-API-compliance-td16538018s177.html

I'd like for my projects to simply set two variables, like 
main.java.version and test.java.version, which should affect the source 
and target version of the compiler, as well as either the jdk version 
selected, or the bootclasspath for the compiler be set accordingly. The 
latter would have the benefit that it would still use the newer 
compiler, thus allowing access to newer compiler flags, e.g. for lint, 
while still ensuring API compatibility to an older version.


Originally I intended to write my own plugin, derived from 
maven-compiler-plugin, to implement this. However, I found out that the 
compiler plugin provides no access to its private fields, and hacking at 
them with reflection feels very nasty, so I'll not do that. Instead of 
maintaining my private branch of the java compiler, maybe this thing is 
interesting enough for people out there to warrant implementation in the 
main compiler plugin. Would you agree?


My first primary concern is how to proceed with this. I could write a 
feature request for the compiler plugin in JIRA, start a wiki page on 
docs.codehaus, or continue discussion here. Or I could do several of 
these things at the same time.


There are also some open questions, which I will list below.

== 1. POM Configuration ==

How to configure this in the POM? I guess backwards compatibility should 
be preserved. The current compiler allows for  and . I 
could think of these settings:



  expression="${maven.compiler.source}"
  default="${main.java.version}"

  expression="${maven.compiler.target}"
  default="${main.java.version}"

  expression="${main.java.version}"

  expression="${maven.compiler.source}"
  default="${test.java.version}"

  expression="${maven.compiler.target}"
  default="${test.java.version}"

  expression="${test.java.version}"

This way, all current settings should continue to work, unless someone 
used one of the newly introduced properties for some different purpose.
More complex mixing of property variables could be done in a parent 
plugin configuration, with the corresponding properties set in 
submodules as required.


Do you agree that this set of configuration parameters would indeed make 
sense?


http://jira.codehaus.org/browse/MCOMPILER-15

== 2. Toolchain ==

The modified plugin would need information about what compilers are 
available, and where the corresponding executables can be located. This 
information is system-specific, not project-specific, so it should 
reside in a maven config file.


This sounds a lot like the toolchains proposal, so maybe there is some 
way to leverage that.


http://jira.codehaus.org/browse/MNG-468
http://docs.codehaus.org/display/MAVEN/Toolchains
http://docs.codehaus.org/display/MAVEN/Toolchains?showComments=true&focusedCommentId=77693099#comment-77693099

== 3. Compiler arguments ==

I only know about javac, so I'm not sure whether this way to set the 
bootclasspath would work for other compilers as well. It would be nice 
to have this handled in a consistent way in the plexus compiler manager 
component, but I guess that would mean changing quite a lot of code. 
Should this be targeted to javac only for the time being?

Does someone know about corresponding settings for other compilers?

== 4. Default behaviour ==

What should be done when the requirements cannot be met? I guess a 
warning would be a good solution. YOu might even want to get a hard 
error for release builds. Should there be a parameter to modify this 
behaviour?



Let me know what you think of all this, how you would suggest I proceed, 
and what other resources might be useful. I don't have too much time to 
spare for this issue, but I believe it important enough to do some work 
for it from time to time, and when I do so, I might as well do so in a 
way that others can profit from it as well.


Greetings,
 Martin von Gagern



signature.asc
Description: OpenPGP digital signature


Re: ANSI color logging in Maven

2008-04-08 Thread Martin von Gagern

James William Dumay wrote:

Rahul,
Something like this library might help you in your quest...

http://sourceforge.net/projects/javacurses/

James


CHARVA might be useful as well:
http://www.pitman.co.za/projects/charva/

It seems both require a native DLL in order to work properly. This makes 
sense for things like single character input, echo control and similar 
terminal settings.


I should assume that color output would work without curses, simply 
using the escape sequences as mentioned. So I'd keep javacurses and 
charva as fallback, or use them if available without depending on them.


Greetings,
 Martin

P.S.: I just recently subscribed to the list, and didn't receive the 
mail I'm responding to, so maybe this answer will break the thread in 
some views. Sorry about that.




signature.asc
Description: OpenPGP digital signature