[ 
https://issues.apache.org/jira/browse/HBASE-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707149#comment-13707149
 ] 

Nick Dimiduk commented on HBASE-8865:
-------------------------------------

After reading through this issue more carefully and also looking at HBASE-6643, 
I think users of the shell would expect all commands that interact with 
byte[]'s to be processed through {{Bytes.toBytesBinary}} and results printed 
using {{Bytes.toStringBinary}}. The trouble with {{toBytesBinary}} is that it 
doesn't take the extra step of performing UTF-8 encoding on non-escaped 
characters.

{code}
      } else {
        b[size++] = (byte) ch;
      }
{code}

That cast to {{byte}} from {{ch}} should instead be the equivalent of:

{noformat}
String.valueOf(ch).getBytes("UTF-8");
{noformat}

I think using this patch as is will break splits for split points containing 
non-escaped unicode characters (ie, ΓΌ), because they're cast to a single 
{{byte}}.
                
> HBase shell split command acts incorrectly with hex split keys.
> ---------------------------------------------------------------
>
>                 Key: HBASE-8865
>                 URL: https://issues.apache.org/jira/browse/HBASE-8865
>             Project: HBase
>          Issue Type: Bug
>          Components: shell, Usability
>    Affects Versions: 0.94.8
>         Environment: Linux
>            Reporter: Ding Haifeng
>         Attachments: 8865.txt
>
>
> When I tried to do a manual region split from HBase shell, I found that split 
> command acts incorrectly with hex split keys. 
> Here is an example.
> I execute hbase(main):003:0> split 'tsdb', "\x00\x00\xC3" .
> While I expect it to split at the 3-byte key "\x00\x00\xC3" , it actually 
> split at a 5-byte key "\x00\x00\xEF\xBF\xBD". 
> I test with more split keys and find some patterns:
> * If the all bytes in the split key represented in hexadecimal are between 
> "\x00" and "\x7F" , it works as expected and split at exactly the key 
> specified.
> * If there are any bytes between "\x80" and "xFF", it works incorrectly. No 
> matter the byte is, it is interpreted as "\xEF\xBF\xBD". Here is another 
> example. Specifying split key "\x00\xA0\x00\xB0" actually splits at 
> "\x00\xEF\xBF\xBD\x00\xEF\xBF\xBD".
> I'm running Hbase 0.94.8, r1485407, both server-side and client-side. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to