[
https://issues.apache.org/jira/browse/ACCUMULO-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488278#comment-13488278
]
Christopher Tubbs commented on ACCUMULO-836:
--------------------------------------------
So, I looked over these changes, and didn't see anything that would be too
problematic... but... I did notice that there are places where we are decoding
bytes into a String, using new String(byte[]), but not specifying the encoding
of the byte[]. This causes a discrepancy in some cases with the corresponding
setter that uses .getBytes(utf8). For instance, in InputFormatBase, we have
{code:java}conf.set(PASSWORD, new String(Base64.encodeBase64(passwd)));{code}
Aside from the problematic fact that this Base64 library encodes to a byte[]
instead of to a String, it doesn't document the fact that these bytes are ASCII
encoded. If the user's system had a default encoding that is incompatible with
ASCII, this constructor may behave unexpectedly or throw an exception, as it
decodes the ASCII into the Java String type. Reading the password has no such
problem... if the password is "Stringified" into the job configuration without
error, then calling getBytes(utf8) on the ASCII characters in the Java String
should not throw an exception. While this is not likely to cause a problem in
the overwhelming majority of cases, it seems inconsistent to be pedantic about
encoding with .getBytes() when we aren't equally pedantic about decoding with
new String(byte[]) and similar.
So, to summarize, I think this is generally on the right path, but needs more
focus on both sides of serialization/deserialization of transient(M/R) /
persistent(zoo) state.
> Specify Charset on getBytes() call for String objects.
> ------------------------------------------------------
>
> Key: ACCUMULO-836
> URL: https://issues.apache.org/jira/browse/ACCUMULO-836
> Project: Accumulo
> Issue Type: Improvement
> Affects Versions: 1.5.0
> Reporter: David Medinets
> Assignee: David Medinets
> Priority: Minor
> Fix For: 1.5.0
>
> Attachments: UTF8.java
>
>
> The comments on ACCUMULO-241 indicate that the build server might have a
> different default Charset than computers used by developers. Therefore, some
> of the tests have different results on different computers.
> Every getBytes call on a String object should specify the UTF8 Charset.
> Unfortunately the codebase has nearly 1,800 getBytes calls.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira