Re: unicode supporting in hive

Zheng Shao Wed, 08 Jul 2009 00:03:38 -0700

Hi Min,

The separators used in Hive are by default ^A, ^B, ^C ... (ascii code
1, 2, 3, etc).
These won't appear in either UTF-8 or GBK:


Please see these code maps for details:
http://en.wikipedia.org/wiki/UTF-8
http://en.wikipedia.org/wiki/GBK


Zheng

On Tue, Jul 7, 2009 at 11:59 PM, Min Zhou<[email protected]> wrote:
> Hi all,
> It seems that hive would go wrong when storing unicode strings. Hive use
> byte comparision for delimiting fields of a record(
> see  LazyStruct.java:92, a parse method).
> If we use gbk or utf-8 encoding where characters would need more than 1
> byte, might 2-3 bytes,  then it would by coincidence seperator for
> delimiting fields equal one of byte in our gbk/utf-8 encoding character.
> thus things go wrong.
> Can hive solve the problem above?
>
> Thanks,
> Min
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>



-- 
Yours,
Zheng

Re: unicode supporting in hive

Reply via email to