[ 
https://jira.codehaus.org/browse/MPH-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=298611#comment-298611
 ] 

Jürgen Hermann edited comment on MPH-87 at 5/14/12 6:07 AM:
------------------------------------------------------------

Writing the result to a file doesn't really help (then the file's content is 
broken, i.e. not well-formed XML). Consider this:
{code}
$ head -n1 pom.xml
<?xml version="1.0"?>

$ grep -m1 name pom.xml | xxd
0000000: 2020 2020 3c6e 616d 653e 4d75 6c74 692d      <name>Multi-
0000010: 4172 6368 6574 7970 6573 2052 6f6f 7420  Archetypes Root 
0000020: 504f 4d20 c3a4 c3b6 c3bc c39f 3c2f 6e61  POM ........</na
0000030: 6d65 3e0a                                me>.

$ MAVEN_OPTS="-Dfile.encoding=iso-8859-15" mvn -Doutput=effective.xml 
help:effective-pom 
...
[INFO] Multi-Archetypes Root POM &#65533;&#65533;&#65533;
...

$ head -n1 effective.xml 
<?xml version="1.0" encoding="UTF-8"?>

$ xmllint effective.xml 
effective.xml:26: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xE4 0xF6 0xFC 0xDF
    <name>Multi-Archetypes Root POM &#65533;&#65533;&#65533;&#65533;</name>

$ mvn -version
Apache Maven 3.0.3 (r1075438; 2011-02-28 18:31:09+0100)
Java version: 1.6.0_26, vendor: Sun Microsystems Inc.
Default locale: en_US, platform encoding: ANSI_X3.4-1968
OS name: "linux", version: "3.0.0-12-generic-pae", arch: "i386", family: "unix"
{code}
i.e. we have a pom.xml with default encoding (UTF-8) containing some properly 
encoded umlauts (c3a4...). The Maven run (with simulating a system that uses 
Latin-9) already doesn't read that correctly and emits replacement characters. 
The resulting XML is a mess, stating *explicitely* it's UTF-8, while containing 
Latin-9.

In summary: Maven doesn't behave deterministically here, and depends on the 
system environment where it shouldn't, leading to hard to find problems that 
occur "out of the blue" for some developers only.
                
      was (Author: jhermann):
    Writing the result to a file doesn't really help (then the file's content 
is broken, i.e. not well-formed XML). Consider this:
{code}
$ head -n1 pom.xml
<?xml version="1.0"?>

$ grep -m1 name pom.xml | xxd
0000000: 2020 2020 3c6e 616d 653e 4d75 6c74 692d      <name>Multi-
0000010: 4172 6368 6574 7970 6573 2052 6f6f 7420  Archetypes Root 
0000020: 504f 4d20 c3a4 c3b6 c3bc c39f 3c2f 6e61  POM ........</na
0000030: 6d65 3e0a                                me>.

$ MAVEN_OPTS="-Dfile.encoding=iso-8859-15" mvn -Doutput=effective.xml 
help:effective-pom 
...
[INFO] Multi-Archetypes Root POM &#65533;&#65533;&#65533;
...

$ head -n1 effective.xml 
<?xml version="1.0" encoding="UTF-8"?>

$ xmllint effective.xml 
effective.xml:26: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xE4 0xF6 0xFC 0xDF
    <name>Multi-Archetypes Root POM &#65533;&#65533;&#65533;&#65533;</name>
{code}
i.e. we have a pom.xml with default encoding (UTF-8) containing some properly 
encoded umlauts (c3a4...). The Maven run (with simulating a system that uses 
Latin-9) already doesn't read that correctly and emits replacement characters. 
The resulting XML is a mess, stating *explicitely* it's UTF-8, while containing 
Latin-9.

In summary: Maven doesn't behave deterministically here, and depends on the 
system environment where it shouldn't, leading to hard to find problems that 
occur "out of the blue" for some developers only.
                  
> help:effective-pom uses platform encoding and garbles non-ascii characters, 
> emits invalid XML
> ---------------------------------------------------------------------------------------------
>
>                 Key: MPH-87
>                 URL: https://jira.codehaus.org/browse/MPH-87
>             Project: Maven 2.x Help Plugin
>          Issue Type: Bug
>    Affects Versions: 2.1.1
>         Environment: Windows, MacOSX, Linux, Maven 3.0.4
>            Reporter: Mirko Friedenhagen
>         Attachments: mfriedenhagen-invalidpom-MPH-87-0-g42a5c31.zip
>
>
> As stated in http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info XML files 
> without a BOM and without a XML encoding declaration should read the XML as 
> UTF-8. 
> {{help:effective-pom}} does use the platform encoding for writing the 
> effective-pom without emitting an appropriate XML encoding declaration in the 
> resulting XML file.
> I have created a small sample project (available at 
> https://github.com/mfriedenhagen/invalidpom, attached as ZIP) which will 
> reproduce the issue.
> While the parent pom 
> (https://raw.github.com/mfriedenhagen/invalidpom/master/pom.xml) has a XML 
> encoding declaration, 
> https://raw.github.com/mfriedenhagen/invalidpom/master/child-invalid/pom.xml 
> has none.
> Now running:
> {code}
> mvn -s settings.xml -gs settings.xml clean validate
> {code}
> will produce an invalid character for the developer name "Jörg" in 
> {{child-invalid}}. 
> Two workarounds are:
> * to include a XML encoding declaration as done in {{child-valid}}. 
> * to use {{JAVA_TOOL_OPTIONS}} on Windows as stated in 
> http://stackoverflow.com/a/623036/49132
> * to use {{MAVEN_OPTS=-Dfile.encoding=utf-8 mvn -s settings.xml -gs 
> settings.xml clean validate}}.
> Nonetheless I consider this a Major bug, as it clearly violates the 
> recommendations of W3C.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://jira.codehaus.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to