Re: unicode supporting in hive

Zheng Shao Wed, 08 Jul 2009 00:04:57 -0700

However, UTF-8 is hard-coded in a lot of places in Hive (actually,
also hadoop, see Text.java)


If you want to use a different encoding like GBK, we will probably
need to extract that UTF-8 out from all the code.

Zheng

On Wed, Jul 8, 2009 at 12:03 AM, Zheng Shao<[email protected]> wrote:
> Hi Min,
>
> The separators used in Hive are by default ^A, ^B, ^C ... (ascii code
> 1, 2, 3, etc).
> These won't appear in either UTF-8 or GBK:
>
> Please see these code maps for details:
> http://en.wikipedia.org/wiki/UTF-8
> http://en.wikipedia.org/wiki/GBK
>
>
> Zheng
>
> On Tue, Jul 7, 2009 at 11:59 PM, Min Zhou<[email protected]> wrote:
>> Hi all,
>> It seems that hive would go wrong when storing unicode strings. Hive use
>> byte comparision for delimiting fields of a record(
>> see  LazyStruct.java:92, a parse method).
>> If we use gbk or utf-8 encoding where characters would need more than 1
>> byte, might 2-3 bytes,  then it would by coincidence seperator for
>> delimiting fields equal one of byte in our gbk/utf-8 encoding character.
>> thus things go wrong.
>> Can hive solve the problem above?
>>
>> Thanks,
>> Min
>> --
>> My research interests are distributed systems, parallel computing and
>> bytecode based virtual machine.
>>
>> My profile:
>> http://www.linkedin.com/in/coderplay
>> My blog:
>> http://coderplay.javaeye.com
>>
>
>
>
> --
> Yours,
> Zheng
>



-- 
Yours,
Zheng

Re: unicode supporting in hive

Reply via email to