+1 for the original proposal, if a newcomer like me is allowed to vote.

The concept with the property, which can be set with the properties until the model is updated, and which can be the default expression for affected plugins, is simply elegant.

Jason van Zyl wrote:
It would be reasonable to assume the detection could be based on a subset. For an organization on one project you could reasonable assume the same encoding. That would not be the case in an open source project as tools would vary.

Suppose you have a huge source tree, mostly english ASCII, but somewhere in there there is a single degree sign, '\u00b0'. How would you detect it, short of scanning every ASCII file until you hit that one?

I support concerns here that the cost of encoding detection may in many cases be prohibitively high. Maven runs too slow as it is, imho. You could of course write an encoding detection plugin which could examine the code and set the required property accordingly. But enabling that by default feels bad to me.

What happens when the encoding is different then what is stated? Same problem really, in how to deal with the actual versus declared.

Up to the plugins, I guess, as it is now. No change there, only a central place to set defaults for all plugins. Of course you could write an encoding checking plugin which ensures that your sources are valid in the specified encoding.

My impression is that usage of
JChardet will significantly increase code complexity without giving me a
solid build.

That would depend on what kinds of problems can arise if things are not consistent.

There are three possible cases:
1. code agrees with setting => all right
2. code disagrees with setting, but is still valid under specified encoding => Mojibake 3. code is invalid under specified encoding => exception or unmappable character symbol, depending on context. Exception maybe handled by plugin.

By specifying ISO-8859-1 as default input encoding, there are no unmappable characters, avoiding case 3. All input should be readable, though the output generated from this might not look as expected.

It should be noted that plugins that generate code to be used by other plugins should have their output encoding default to the general input encoding, so that there are no breaks in the chain.

As Jason writes about consistency, I guess the danger of inconsistent input handling, as different plugins might be configured to read it using different charsets, is exactly the kind of inconsistency to be addressed by this proposal, so I'd expect more consistency after it has been implemented, not less.

Greetings,
 Martin von Gagern


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to