[ 
https://issues.apache.org/jira/browse/ACCUMULO-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489228#comment-13489228
 ] 

Josh Elser commented on ACCUMULO-836:
-------------------------------------

*GrepIterator*: It should be noted (javadoc) that the String being converted to 
bytes will be treated as UTF-8 encoded bytes or not make the UTF-8 assertion at 
all. 

*MetadataTable#encode(), DistributedReadWriteLock#getLockData()*: Should note 
that the byte[] return from the specified method is utf-8 bytes.

*LongCombiner.StringEncoder, StringMax, StringMin, StringSummation, 
SummingArrayCombiner.StringArrayEncoder, Authorizations, 
Master#mergeMetadataRecords*: These classes are creating bytes that are UTF-8, 
but when the bytes are initially read into a String (from a Value typically), 
the default encoding is used (String constructor that takes a byte array). This 
leads to inconsistency as the data could have been read as something other than 
UTF-8 but then written back out as UTF-8. A decision needs to make what to do 
and that decision needs to be documented.

*ZooStore*: Some awkwardness pops out at me in #setProperty(long, String, 
Serializable) manually adding bytes to the data to be written to ZooKeeper. I 
don't think UTF-8 will cause any problems, but it could definitely use some 
clarification.

*TraceServer.Receiver, IndexMeta, AddFilesWithMissingEntries, MetadataTable*: 
Writes out a Value in utf-8 bytes, but I'm not confident if there is any case 
in which a client reading that data would expect something else. Documentation 
again would be useful. The likelihood of this being an issue is probably small 
considering that Hadoop's WritableUtils encodes Strings as UTF-8.

I'm still a little concerned about access points to ZooKeeper and !METADATA, 
but given that ZooReaderWriter was converting the username and password as 
UTF-8 bytes I feel slightly better. I should dig into that code more tomorrow.

One final statement, I still believe that in the ambiguous cases where core 
classes read arbitrary bytes and write UTF-8 bytes, Accumulo should be agnostic 
and not make encoding assertions. In other words, I think we should revert 
those changes and leave it up to the user to decide how they handle their bytes.
                
> Specify Charset on getBytes() call for String objects.
> ------------------------------------------------------
>
>                 Key: ACCUMULO-836
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-836
>             Project: Accumulo
>          Issue Type: Improvement
>    Affects Versions: 1.5.0
>            Reporter: David Medinets
>            Assignee: David Medinets
>            Priority: Minor
>             Fix For: 1.5.0
>
>         Attachments: UTF8.java
>
>
> The comments on ACCUMULO-241 indicate that the build server might have a 
> different default Charset than computers used by developers. Therefore, some 
> of the tests have different results on different computers.
> Every getBytes call on a String object should specify the UTF8 Charset. 
> Unfortunately the codebase has nearly 1,800 getBytes calls.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to