Accumulo may not be just a set of servers, but it is designed to be a set of processes, which means having their own JVM. I think this mostly boils down to an issue of API however-- if Accumulo deals with user's data in terms of bytes, then this issue is put back on the user, which I'm fine with as a trade off between configuration versus convention.
There are other cases beyond simply a client API, though, namely configuration. I'm more comfortable with enforcing some standard there. On Tue, Oct 30, 2012 at 8:31 PM, Benson Margulies <[email protected]>wrote: > On Tue, Oct 30, 2012 at 8:21 PM, Josh Elser <[email protected]> wrote: > > On 10/30/2012 7:47 PM, David Medinets wrote: > >>> > >>> My issue with this is that you have now hard-coded the fact that > everyone > >>> else is going to use UTF-8. > >> > >> > >> Who is everyone else? I agree that I have hard-coded the use of UTF-8. > >> On the other hand, I've merely codified an existing practice. Thus the > >> issue is now exposed, the places the convention is used are defined. > >> Once a consensus is reached, we can implement it with confidence. > > > > > > "Everyone else" is everyone who builds Accumulo since you committed your > > changes and uses it. Ignoring that, forcing a single charset isn't the > big > > issue here (as we've *all* agreed that UTF-8 should not cause any > > data-correctness issues) so for now I'll just drop it as it's just > creating > > confusion. > > > > My issue is *how* you implemented the default charset. We already have 3 > > people (Marc, Bill and myself) who have stated that we believe inline > > charset declaration is not the correct implementation and that using the > JVM > > property is the better implementation. > > > > I'd encourage others to weigh in to form a complete consensus and shift > the > > discussion to that implementation if needed. > > > >> > >>> way to fix the problem. I still contest that setting the desired > encoding > >>> (via the appropriate JVM property like Bill Slacum initial suggested) > is > >>> the > >>> proper way to address the issue. > >> > >> > >> It is easy to do both. Create a ByteEncodingInitializer (or somesuch) > >> class that reads the JVM property and defines a globally used Charset. > >> The find those utf8 definitions and usages and replace them with the > >> globally-defined value. > > > > > > Again, by setting the 'file.encoding' JVM parameter, such a class is > > unnecessary because it should be handled internal to Java. For Oracle/Sun > > JDK and OpenJDK, setting the "file.encoding" parameter at run time will > use > > the provided charset you wanted without actually changing any code. > > If Accumulo was only a pile of servers, you could do this. You could > say that part of the configuration process for the servers is to > specify the desired encoding to file.encoding, and your shell scripts > could set UTF-8 by default. > > But Accumulo is *not* just a pile of servers. Setting file.encoding > effects the entire JVM. A webapp that uses Accumulo now would need to > have the entire servlet container have a particular setting of > file.encoding. This just does not work in the wild. Even without the > servlet container issue, a user of Accumulo may need to plug it into > an existing code base that has other reasons to set file.encoding, and > will not like it when Accumulo starts to corrupt his or her string > data. >
