Unfortunately, I have exceeded my time budget for working on an issue of this complexity without a concise and clear design document.
I may be able to help later if a concise and clear design document becomes available. On Tue, Aug 2, 2016 at 8:41 PM, Yuanhao Luo <[email protected]> wrote: > > Hello, Jim Apple. > > For now in my commit, field terminators can not be set to extended ASCIII > characters. > After using statement "create table text_thorn_ecirc_newline(col1 string, > col2 string, col3 int, col4 int) row format delimited fields terminated by > 'þ' escaped by '-22' lines termiated by '\n';" to create table, the result > of "describe extended text_thorn_ecirc_newline" is : > > [nobida147:21000] > describe extended text_thorn_ecirc_newline; > Query: describe extended text_thorn_ecirc_newline > Query submitted at: 2016-08-03 10:57:11 (Coordinator: http://0.0.0.0:25000) > Query progress can be monitored at: > http://0.0.0.0:25000/query_plan?query_id=fd4bf0a9154be6a7:b3f6dcabe9dea3ba > +------------------------------+------------------------------------------------------------------------------------+----------------------+ > | name | type > | comment | > +------------------------------+------------------------------------------------------------------------------------+----------------------+ > | # col_name | data_type > | comment | > | | NULL > | NULL | > | col1 | string > | NULL | > | col2 | string > | NULL | > | col3 | int > | NULL | > | col4 | int > | NULL | > | | NULL > | NULL | > | # Detailed Table Information | NULL > | NULL | > | Database: | multi_byte_test2 > | NULL | > | Owner: | root > | NULL | > | CreateTime: | Wed Aug 03 10:55:25 CST 2016 > | NULL | > | LastAccessTime: | UNKNOWN > | NULL | > | Protect Mode: | None > | NULL | > | Retention: | 0 > | NULL | > | Location: | > hdfs://localhost:20500/test-warehouse/multi_byte_test2.db/text_thorn_ecirc_newline > | NULL | > | Table Type: | MANAGED_TABLE > | NULL | > | Table Parameters: | NULL > | NULL | > | | transient_lastDdlTime > | 1470192925 | > | | NULL > | NULL | > | # Storage Information | NULL > | NULL | > | SerDe Library: | > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > | NULL | > | InputFormat: | org.apache.hadoop.mapred.TextInputFormat > | NULL | > | OutputFormat: | > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > | NULL | > | Compressed: | No > | NULL | > | Num Buckets: | 0 > | NULL | > | Bucket Columns: | [] > | NULL | > | Sort Columns: | [] > | NULL | > | Storage Desc Params: | NULL > | NULL | > | | escape.delim > | -22 | > | | field.delim > | \u00FE | > | | line.delim > | \n | > | | serialization.format > | \u00FE | > +------------------------------+------------------------------------------------------------------------------------+----------------------+ > Fetched 32 row(s) in 0.10s > > We can see that the filed delimiter is correctly parsed to extended ascii > character with decimal value 254(last three lines in above log). However, > when running query "select * from text_thorn_ecirc_newline", the result is : > > [nobida147:21000] > select * from text_thorn_ecirc_newline; > Query: select * from text_thorn_ecirc_newline > Query submitted at: 2016-08-03 11:01:01 (Coordinator: http://0.0.0.0:25000) > Query progress can be monitored at: > http://0.0.0.0:25000/query_plan?query_id=fd494eef6abac951:972191dbc5e1bd94 > +------------------+------+------+------+ > | col1 | col2 | col3 | col4 | > +------------------+------+------+------+ > | one�two�3�4 | NULL | NULL | NULL | > | one�one�two�3�4 | NULL | NULL | NULL | > | one��two�3�4 | NULL | NULL | NULL | > | one��one�two�3�4 | NULL | NULL | NULL | > | one���two�3�4 | NULL | NULL | NULL | > +------------------+------+------+------+ > Fetched 5 row(s) in 0.44s > > After debug, I found the value of field_delim_.size() is 2(we expect 1) in > https://gerrit.cloudera.org/#/c/3314/7/be/src/exec/delimited-text-parser.cc@126. > Member function size() returns the bytes of the string as > http://www.cplusplus.com/reference/string/string/size/ illustrates. If we > can get the "correct size" of string here, I believed that we could use > extended ASCII characters either. > > And for tests of two corner cases, I have posted my logs in IMPALA-3945. > > > ------------------ 原始邮件 ------------------ > 发件人: "jbapple";<[email protected]>; > 发送时间: 2016年8月3日(星期三) 上午6:52 > 收件人: "Yuanhao Luo"<[email protected]>; > 抄送: "dev@impala"<[email protected]>; > 主题: Re: Re: Re: IMPALA-2428 Support multiple-character string as the field > delimiter > > Also, you asked > >> I'm wondering whether have you ever test these two cases. > > I do not know. Can you check and report back what you find? > > On Tue, Aug 2, 2016 at 3:49 PM, Jim Apple <[email protected]> wrote: >>> What's more, in this patch, we can use only standard ASCII >>> characters(with >>> decimal value from 0 to 127) in ascii or octal format to set filed >>> terminator, but not extended ASCII characters(with decimal value from 128 >>> to >>> 255) or standard ASCII characters in unicode, decimal or hexadecimal >>> format. >> >> Today, can field terminators be set to extended ASCIII characters (not >> with octal, but raw)?
