[jira] [Commented] (ACCUMULO-836) Specify Charset on getBytes() call for String objects.

Christopher Tubbs (JIRA) Wed, 31 Oct 2012 14:59:13 -0700

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488278#comment-13488278
 ]


Christopher Tubbs commented on ACCUMULO-836:
--------------------------------------------

So, I looked over these changes, and didn't see anything that would be too 
problematic... but... I did notice that there are places where we are decoding 
bytes into a String, using new String(byte[]), but not specifying the encoding 
of the byte[]. This causes a discrepancy in some cases with the corresponding 
setter that uses .getBytes(utf8). For instance, in InputFormatBase, we have 
{code:java}conf.set(PASSWORD, new String(Base64.encodeBase64(passwd)));{code} 
Aside from the problematic fact that this Base64 library encodes to a byte[] 
instead of to a String, it doesn't document the fact that these bytes are ASCII 
encoded. If the user's system had a default encoding that is incompatible with 
ASCII, this constructor may behave unexpectedly or throw an exception, as it 
decodes the ASCII into the Java String type. Reading the password has no such 
problem... if the password is "Stringified" into the job configuration without 
error, then calling getBytes(utf8) on the ASCII characters in the Java String 
should not throw an exception. While this is not likely to cause a problem in 
the overwhelming majority of cases, it seems inconsistent to be pedantic about 
encoding with .getBytes() when we aren't equally pedantic about decoding with 
new String(byte[]) and similar.

So, to summarize, I think this is generally on the right path, but needs more 
focus on both sides of serialization/deserialization of transient(M/R) / 
persistent(zoo) state.
                
> Specify Charset on getBytes() call for String objects.
> ------------------------------------------------------
>
>                 Key: ACCUMULO-836
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-836
>             Project: Accumulo
>          Issue Type: Improvement
>    Affects Versions: 1.5.0
>            Reporter: David Medinets
>            Assignee: David Medinets
>            Priority: Minor
>             Fix For: 1.5.0
>
>         Attachments: UTF8.java
>
>
> The comments on ACCUMULO-241 indicate that the build server might have a 
> different default Charset than computers used by developers. Therefore, some 
> of the tests have different results on different computers.
> Every getBytes call on a String object should specify the UTF8 Charset. 
> Unfortunately the codebase has nearly 1,800 getBytes calls.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ACCUMULO-836) Specify Charset on getBytes() call for String objects.

Reply via email to