Hi Min, The separators used in Hive are by default ^A, ^B, ^C ... (ascii code 1, 2, 3, etc). These won't appear in either UTF-8 or GBK:
Please see these code maps for details: http://en.wikipedia.org/wiki/UTF-8 http://en.wikipedia.org/wiki/GBK Zheng On Tue, Jul 7, 2009 at 11:59 PM, Min Zhou<[email protected]> wrote: > Hi all, > It seems that hive would go wrong when storing unicode strings. Hive use > byte comparision for delimiting fields of a record( > see LazyStruct.java:92, a parse method). > If we use gbk or utf-8 encoding where characters would need more than 1 > byte, might 2-3 bytes, then it would by coincidence seperator for > delimiting fields equal one of byte in our gbk/utf-8 encoding character. > thus things go wrong. > Can hive solve the problem above? > > Thanks, > Min > -- > My research interests are distributed systems, parallel computing and > bytecode based virtual machine. > > My profile: > http://www.linkedin.com/in/coderplay > My blog: > http://coderplay.javaeye.com > -- Yours, Zheng
