Benjamin Bentmann wrote:
With regard to user errors, my general suggestion is to fail the build. This unforgiving attitude should not be that unfamilar to users: It has been chosen for a popular format like XML which is also employed by Maven for a few files.

The problems depend on the encodings: If one feeds Latin-1 into an UTF-8 decoder, you most likely encouter invalid byte sequences, making the decoder fail. That's my favorite case as it clearly shows the user something is wrong and needs his attention. The other case is worse because more subtle: Feeding UTF-8 into a Latin-1 decoder will pass but produces output that only a human can tell being garbage by closing analyzing the few Non-ASCII characters.

Taking this together, one might argue to have UTF-8 the default, not ISO-8859-1.

Almost anything that passes UTF-8 encoding constraints will be indeed UTF-8, as non-ASCII files that are not UTF-8 will almost certainly contain sequences not valid in UTF-8. So if a user fails to specify the encoding he uses, and if this encoding isn't UTF-8, then things will break for him. This has two advantages:

1. fail-fast behaviour. If there is a misconfiguration, the maven run will die, and the developer can fix the issue. You don't have to wait for some other developer complaining about garbled strings or a user complaining about a broken website until you can fix things.

2. promote unicode. While there are a lot of encosings out there for historic reasons, most of them suffer severe drawbacks in an international software project, because they either can't express all needed characters, or they are not common outside a small region. So while Taiwanese developers might be happy to develop an English/Chinese project in Big5, prospective american Contributors might not get their editor to load files as Big5. UTF-8, on the other hand, is used worldwide and provides the whole Unicode range. For new projects, I guess UTF-8 would be a reasonable best practice, and making this best practice the default in maven might promote it.

Of course this conflicts with previous discussions about Latin1 ensuring that any project can get compiled, as it has no invalid byte sequences. The choice is whether, in the absence of configuration,

A) you want your compile to succeed all the time, possibly generating the wrong results, or

B) you want your build to fail in case of a misconfiguration (including missing configuration), but ensure correct results if it does not fail.

If I understood him correctly, Jason voted for A). I took his request for non-dying builds as a requirement and pointed out that this is possible with Latin1. Now that I think about it, I believe I would rather want B), as I'm all for failfast deterministic behaviour.

It should be checked whether plugins really die for invalid UTF-8 sequences, and what the output looks like. If possible, plugins should point out that a misconfiguration of the encoding in the pom (either the plugin configuration or the proposed global configuration property) is possibly the cause of the error, if it's not a developer using another encoding.

Note that ASCII-only sources will compile cleanly no matter the default encoding, so all projects that don't need to worry about encoding won't be forced to do so. Only international projects where encoding is relevant will force their developers to either follow best practices or explicitely state their policy.

Greetings,
 Martin

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to