[ 
https://issues.apache.org/jira/browse/ACCUMULO-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488024#comment-13488024
 ] 

Christopher Tubbs commented on ACCUMULO-840:
--------------------------------------------

There are two issues here. The first is establishing a standard encoding for 
all Accumulo internal persistent state/metadata, and the second is how to 
automatically encode API convenience methods that accept String or char[] or 
CharSequence (from here on, I'll refer to these three collectively as 
"Strings"). I'll deal with the latter first:

API: It is important to note that Accumulo deals only with bytes. That's it. We 
don't guarantee a sort order for Strings with arbitrary (or configurable) 
encoding, though some have asked for custom comparators to achieve fine-grained 
control over this. Instead, we only guarantee a sort order for bytes, sorted 
numerically byte-by-byte, from most significant to least. It is important to 
realize that we only deal with bytes internally, because all of the API 
decisions appear to be centered around that idea. This is why you almost always 
see a Text object, because it holds an arbitrary byte array. It is true that 
Text has a constructor that accepts a String, and it has a very specific 
encoding when it does so (UTF8 only, as per its documentation). We have copied 
this behavior in some of our APIs to add convenience methods that accept 
Strings, because it's easier than forcing users to do write {code:java}new 
Mutation(new Text("myString".getBytes("UTF8")));{code} It is so much easier to 
do {code:java}new Mutation("myString");{code}. This does not change the 
behavior, though. We still expect convenience methods that accept Strings to 
behave as though we had converted a String to UTF8 and passed in the resulting 
bytes (in a Text object) to the method.

API (cont.): Now, it may be the case that the API could benefit from 
convenience wrappers that accept Strings with a specific encoding, or we could 
change the behavior of those we have to respect the JVM's "file.encoding" 
property, and simply pre-encode the Strings before we throw their resulting 
bytes into a Text object. This may be useful and convenient, but this is a VERY 
LIMITED SCOPE, and it's important to realize that any consideration of changes 
to the way we encode things should focus on this scope, and not go crazy, 
changing all instances of "String-based" uses of ".getBytes()" in the code. 
Regardless of whether we make such changes, though, we should update our 
Javadocs to ensure that the encoding we use for these convenience methods is 
described. It is in the case of Mutation... I'm not sure about elsewhere.

INTERNAL: The other scope to consider for encoding has to do with our internal 
storage (metadata we store in Zookeeper, in the !METADATA table, and other 
places where Accumulo writes persistent state). It is imperative that we 
maintain consistency in the way we interpret our persistent state. For this 
scope, we absolutely should stick to an encoding, but it should be hard-coded 
(use a Constant or a util method, for convenience), and should not respect any 
user configurable field. This is important, because a user should be able to 
change his/her JVM's encoding settings (for the API scope described above) and 
it should *NOT* affect our ability to read and understand data that we've 
previously written to Zookeeper or !METADATA (or elsewhere).

INTERNAL (cont.): For the internal, persistent state's encoding, I'm 
comfortable assuming that we're already treating all persistent Strings storage 
as UTF-8 encoded (because we move things around in Text objects a lot, and for 
those things we aren't, we're probably using ASCII, and can safely treat it as 
UTF-8). If there are any situations where we are storing persistent state 
ambiguously, based on anything other than the hard-coded UTF-8 encoding, such 
that it might cause a problem if a user were to change an OS setting, or 
non-ASCII data can find its way in, we should treat such as a bug.

As far as I see it, these are the only two scopes we need to concern ourselves 
with when considering encoding.
                
> Allow String-based getBytes calls to pick Charset ending from JVM setting.
> --------------------------------------------------------------------------
>
>                 Key: ACCUMULO-840
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-840
>             Project: Accumulo
>          Issue Type: Improvement
>    Affects Versions: 1.5.0
>            Reporter: David Medinets
>            Assignee: David Medinets
>            Priority: Minor
>             Fix For: 1.5.0
>
>
> ACCUMULO-836 changed all String-based getBytes() calls to use the UTF-8 
> standard. However, there is a JVM setting called "jvm.encoding" that should 
> be honored. See 
> http://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding
>  for a discussion of JAVA_TOOL_OPTIONS which might be relevant to this topic. 
> http://javarevisited.blogspot.com/2012/01/get-set-default-character-encoding.html
>  is also a good page to read especially the comment on how character encoding 
> is cached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to