Re: Shell Charset?

Keith Turner Mon, 06 May 2013 11:10:01 -0700

On Sun, May 5, 2013 at 6:49 PM, Drew Farris <[email protected]> wrote:

> In o.a.a.core.uti.shell.commands.OptUtil, I notice that getStartRow and
> getEndRow, use the following snippet to read their arguments:
>
> new Text(cl.getOptionValue(END_ROW_OPT).getBytes(Shell.CHARSET));
>
> Here, Shell.CHARSET is set to ISO-8859-1
>
> This seems to mean that if I use UTF-8 characters (unescaped) from the
> shell to set my begin or end row, that I will not get what I expect because
> the conversion from String to bytes would be performed using the incorrect
> character set.
>
> For example, in the following snippet, testIso fails while testUTF succeeds
> (when the encoding of the source file is UTF-8):
>
>
>   @Test
>
>   public void testISO() throws Exception {
>
>     String s = "本条目是介紹";
>
>     String charset = "ISO-8859-1";
>
>     Text t = new Text(s.getBytes(charset));
>
>     Assert.assertEquals(s, t.toString());
>
>   }
>
>
>   @Test
>
>   public void testUTF() throws Exception {
>
>     String s = "本条目是介紹";
>
>     String charset = "UTF-8";
>
>     Text t = new Text(s.getBytes(charset));
>
>     Assert.assertEquals(s, t.toString());
>
>   }
>
>
> Possibly this should be locale dependent behavior? Also, perhaps I'm
> missing the fact that the Shell is not supposed to support UTF-8 characters
> in start and end ranges, and users must escape their strings appropriately.
> (Which would be a bit of a pain).
>


I think the way the shell is written, it pushes binary data (that may not
be UTF-8) through strings.  I think at some point the \xNN escape codes are
converted to binary and this data is pushed back into a String.
  ISO-8859-1 ensures this works.   Ideally the shell would not do this.


>
>
> - Drew
>

Re: Shell Charset?

Reply via email to