[ 
https://issues.apache.org/jira/browse/CODEC-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085261#comment-13085261
 ] 

Gary D. Gregory commented on CODEC-127:
---------------------------------------

Arg:
{noformat}
C:\svn\org\apache\commons\trunks-proper\codec>perl -MWild -ne "$.=1 if $s ne 
$ARGV;print qq($ARGV:$. $_) if m/\P{ASCII}/;$s=$ARGV;" */*.java
Can't open */*.java: Invalid argument.
{noformat}


> Non-ascii characters in source files
> ------------------------------------
>
>                 Key: CODEC-127
>                 URL: https://issues.apache.org/jira/browse/CODEC-127
>             Project: Commons Codec
>          Issue Type: Bug
>            Reporter: Sebb
>
> Some of the test cases include characters in a native encoding (possibly 
> UTF-8), rather than using Unicode escapes.
> This can cause a problem for IDEs if they don't know the encoding (e.g. cause 
> compilation errors, which is how I found the issue), and possibly some 
> transformations may corrupt the contents, e.g. fixing EOL.
> I think we should have a rule of using Unicode escapes for all such non-ascii 
> characters.
> It's particularly important for non-ISO-8859-1 characters.
> Some example classes with non-ascii characters:
> {code}
> binary\Base64Test.java:96         byte[] decode = 
> b64.decode("SGVsbG{´┐¢´┐¢´┐¢´┐¢´┐¢´┐¢}8gV29ybGQ=");
> language\ColognePhoneticTest.java:110             {"m├Ânchengladbach", 
> "664645214"},
> language\ColognePhoneticTest.java:130         String[][] data = 
> {{"bergisch-gladbach", "174845214"}, {"M├╝ller-L├╝denscheidt", "65752682"}};
> language\ColognePhoneticTest.java:137             {"Meyer", "M├╝ller"},
> language\ColognePhoneticTest.java:143             {"ganz", "Gänse"},
> language\DoubleMetaphoneTest.java:1222         
> this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "S");
> language\DoubleMetaphoneTest.java:1227         
> this.getDoubleMetaphone().isDoubleMetaphoneEqual("´┐¢", "N");
> language\SoundexTest.java:367         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:369                 Assert.assertEquals("´┐¢000", 
> this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:375             Assert.assertEquals("", 
> this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:387         if (Character.isLetter('´┐¢')) {
> language\SoundexTest.java:389                 Assert.assertEquals("´┐¢000", 
> this.getSoundexEncoder().encode("´┐¢"));
> language\SoundexTest.java:395             Assert.assertEquals("", 
> this.getSoundexEncoder().encode("´┐¢"));
> {code}
> The characters are probably not correct above, because I used a crude perl 
> script to find them:
> {code}
> perl -ne "$.=1 if $s ne $ARGV;print qq($ARGV:$. $_) if 
> m/\P{ASCII}/;$s=$ARGV;" */*.java
> {code}
> language\SoundexTest.java:367 in particular is incorrect, because it's 
> supposed to be a single character.
> Now one might think that native2ascii -encoding UTF-8 would fix that, but it 
> gives:
> if (Character.isLetter('\ufffd'))
> which is an "unknown" character.
> Similarly for binary\Base64Test.java:96.
> It's not all that clear what the Unicode escapes should be in these cases, 
> but probably not the unknown character.
> [Possibly the characters got mangled at some point, or maybe they have always 
> been wrong]
> The ColognePhoneticTest.java cases are less serious, as the characters are 
> valid ISO-8859-1 (accented German), but given that the rest of the file uses 
> unicode escaps, I think they should be changed too (but add comments to say 
> what they are, e.g. o-umlaut, u-umlaut)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to