I think there might be memory efficiency issues that should pursuade us to adopt charsets rather than the current approach. I believe (rtfs to be sure) that charset.decode doesn't deep copy the underlying byte buffer which is presumably good from a gc standpoint.
Either way, UTF8 is certainly the most widely used charset in existing deployments and changing the default to something that is non-backwards compatible is probably a bad idea. I'm not familiar with the characteristics of the alternatives, but I strongly believe any across the board change needs to be compatible with existing deployments. Perhaps a better approach than a JVM option or forcing one standard would be to create a configuration option. On Oct 29, 2012 9:22 PM, "Drew Farris" <[email protected]> wrote: > I have always wondered if there were cases in the API where users are > forced to use Text when they would otherwise prefer byte[], e.g: stuffing a > non utf8 byte[] into a Text object to facilitate storage or sorting. Not > entirely sure whether Text would complain if this were the case. I suspect > we should seek to elimimate these if they currently exist. > > Speaking strictly of user data, I agree that fundamentally, every operation > should be based upon byte[]. API methods providing Text and String based > calls should be convience methods where the conversion of text to/from > bytes is handled explicitly (not relying on platform default encoding or > properties) and transparently (doing something sensible when the user > doesn't care or is unaware of the issues surrounding character encoding). > > Regarding utf8, is there a need to support arbitrary character encodings > when persisting bytes to accumulo? Think byte order for lexical sorting, > fixed vs variable length, etc. Perhaps it would not be unreasonable to > support explicitly stating a character encoding on table creation? > > Drew > On Oct 29, 2012 8:47 PM, "Josh Elser" <[email protected]> wrote: > > > +1 Mike. > > > > 1. It would be hard for me to believe Key/Value are ever handled > > internally in terms of Strings, but, if such a case does exist, it would > be > > extremely prudent to fix. > > > > 2. FWIW, the Shell does use ISO-8859-1 as its charset which is referenced > > by other commands [1,2]. It would be good to double check all of the > other > > commands. > > > > [1] https://github.com/apache/**accumulo/blob/trunk/core/src/** > > main/java/org/apache/accumulo/**core/util/shell/Shell.java< > https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/Shell.java > > > > [2] https://github.com/apache/**accumulo/blob/trunk/core/src/** > > main/java/org/apache/accumulo/**core/util/shell/commands/** > > InsertCommand.java< > https://github.com/apache/accumulo/blob/trunk/core/src/main/java/org/apache/accumulo/core/util/shell/commands/InsertCommand.java > > > > > > On 10/29/2012 8:27 PM, Michael Flester wrote: > > > >> I agree with Benson entirely with one caveat. It seems to me that there > >> might be two categories of things being discussed > >> > >> 1. User data (keys and values) > >> 2. Ancillary things needed for operation of Accumulo (passwords). > >> > >> These could well be considered separately. Trying to do anything with > >> keys and values other than treating them as bytes all of the time > >> I find quite scary. > >> > >> And if this is only being done to satisfy pmd or findbugs, those tools > >> can be convinced to modify their reporting about this issue. > >> > >> >
