On Mon, May 6, 2013 at 2:49 PM, Josh Elser <[email protected]> wrote:
> Would a better long-term solution be to just deal with it in a new shell > that actually supports all sorts of constructs outside of the current shell > commands? > > I'm thinking of Python where you have the ability to specify things like > u'\0000'. The proxy would certainly drop the barrier of doing something > like this. > > Would that be overkill to work towards in 1.6? Does this merit fixing > sooner? there is ACCUMULO-1045 > > > On 5/6/13 2:09 PM, Keith Turner wrote: > >> On Sun, May 5, 2013 at 6:49 PM, Drew Farris <[email protected]> wrote: >> >> In o.a.a.core.uti.shell.commands.**OptUtil, I notice that getStartRow >>> and >>> getEndRow, use the following snippet to read their arguments: >>> >>> new Text(cl.getOptionValue(END_**ROW_OPT).getBytes(Shell.**CHARSET)); >>> >>> Here, Shell.CHARSET is set to ISO-8859-1 >>> >>> This seems to mean that if I use UTF-8 characters (unescaped) from the >>> shell to set my begin or end row, that I will not get what I expect >>> because >>> the conversion from String to bytes would be performed using the >>> incorrect >>> character set. >>> >>> For example, in the following snippet, testIso fails while testUTF >>> succeeds >>> (when the encoding of the source file is UTF-8): >>> >>> >>> @Test >>> >>> public void testISO() throws Exception { >>> >>> String s = "本条目是介紹"; >>> >>> String charset = "ISO-8859-1"; >>> >>> Text t = new Text(s.getBytes(charset)); >>> >>> Assert.assertEquals(s, t.toString()); >>> >>> } >>> >>> >>> @Test >>> >>> public void testUTF() throws Exception { >>> >>> String s = "本条目是介紹"; >>> >>> String charset = "UTF-8"; >>> >>> Text t = new Text(s.getBytes(charset)); >>> >>> Assert.assertEquals(s, t.toString()); >>> >>> } >>> >>> >>> Possibly this should be locale dependent behavior? Also, perhaps I'm >>> missing the fact that the Shell is not supposed to support UTF-8 >>> characters >>> in start and end ranges, and users must escape their strings >>> appropriately. >>> (Which would be a bit of a pain). >>> >>> I think the way the shell is written, it pushes binary data (that may >> not >> be UTF-8) through strings. I think at some point the \xNN escape codes >> are >> converted to binary and this data is pushed back into a String. >> ISO-8859-1 ensures this works. Ideally the shell would not do this. >> >> >> >>> - Drew >>> >>> >
