However, UTF-8 is hard-coded in a lot of places in Hive (actually, also hadoop, see Text.java)
If you want to use a different encoding like GBK, we will probably need to extract that UTF-8 out from all the code. Zheng On Wed, Jul 8, 2009 at 12:03 AM, Zheng Shao<[email protected]> wrote: > Hi Min, > > The separators used in Hive are by default ^A, ^B, ^C ... (ascii code > 1, 2, 3, etc). > These won't appear in either UTF-8 or GBK: > > Please see these code maps for details: > http://en.wikipedia.org/wiki/UTF-8 > http://en.wikipedia.org/wiki/GBK > > > Zheng > > On Tue, Jul 7, 2009 at 11:59 PM, Min Zhou<[email protected]> wrote: >> Hi all, >> It seems that hive would go wrong when storing unicode strings. Hive use >> byte comparision for delimiting fields of a record( >> see LazyStruct.java:92, a parse method). >> If we use gbk or utf-8 encoding where characters would need more than 1 >> byte, might 2-3 bytes, then it would by coincidence seperator for >> delimiting fields equal one of byte in our gbk/utf-8 encoding character. >> thus things go wrong. >> Can hive solve the problem above? >> >> Thanks, >> Min >> -- >> My research interests are distributed systems, parallel computing and >> bytecode based virtual machine. >> >> My profile: >> http://www.linkedin.com/in/coderplay >> My blog: >> http://coderplay.javaeye.com >> > > > > -- > Yours, > Zheng > -- Yours, Zheng
