Hello, Jim Apple.

For now in my commit, field terminators can not be set to extended ASCIII 
characters.

After using statement "create table text_thorn_ecirc_newline(col1 string, col2 
string, col3 int, col4 int) row format delimited fields terminated by 'þ' 
escaped by '-22' lines termiated by '\n';"  to create table, the result of 
"describe extended text_thorn_ecirc_newline" is :


[nobida147:21000] > describe extended text_thorn_ecirc_newline;
Query: describe extended text_thorn_ecirc_newline
Query submitted at: 2016-08-03 10:57:11 (Coordinator: http://0.0.0.0:25000)
Query progress can be monitored at: 
http://0.0.0.0:25000/query_plan?query_id=fd4bf0a9154be6a7:b3f6dcabe9dea3ba
+------------------------------+------------------------------------------------------------------------------------+----------------------+
| name                         | type                                           
                                    | comment              |
+------------------------------+------------------------------------------------------------------------------------+----------------------+
| # col_name                   | data_type                                      
                                    | comment              |
|                              | NULL                                           
                                    | NULL                 |
| col1                         | string                                         
                                    | NULL                 |
| col2                         | string                                         
                                    | NULL                 |
| col3                         | int                                            
                                    | NULL                 |
| col4                         | int                                            
                                    | NULL                 |
|                              | NULL                                           
                                    | NULL                 |
| # Detailed Table Information | NULL                                           
                                    | NULL                 |
| Database:                    | multi_byte_test2                               
                                    | NULL                 |
| Owner:                       | root                                           
                                    | NULL                 |
| CreateTime:                  | Wed Aug 03 10:55:25 CST 2016                   
                                    | NULL                 |
| LastAccessTime:              | UNKNOWN                                        
                                    | NULL                 |
| Protect Mode:                | None                                           
                                    | NULL                 |
| Retention:                   | 0                                              
                                    | NULL                 |
| Location:                    | 
hdfs://localhost:20500/test-warehouse/multi_byte_test2.db/text_thorn_ecirc_newline
 | NULL                 |
| Table Type:                  | MANAGED_TABLE                                  
                                    | NULL                 |
| Table Parameters:            | NULL                                           
                                    | NULL                 |
|                              | transient_lastDdlTime                          
                                    | 1470192925           |
|                              | NULL                                           
                                    | NULL                 |
| # Storage Information        | NULL                                           
                                    | NULL                 |
| SerDe Library:               | 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                              
   | NULL                 |
| InputFormat:                 | org.apache.hadoop.mapred.TextInputFormat       
                                    | NULL                 |
| OutputFormat:                | 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                      
   | NULL                 |
| Compressed:                  | No                                             
                                    | NULL                 |
| Num Buckets:                 | 0                                              
                                    | NULL                 |
| Bucket Columns:              | []                                             
                                    | NULL                 |
| Sort Columns:                | []                                             
                                    | NULL                 |
| Storage Desc Params:         | NULL                                           
                                    | NULL                 |
|                              | escape.delim                                   
                                    | -22                  |
|                              | field.delim                                    
                                    | \u00FE               |
|                              | line.delim                                     
                                    | \n                   |
|                              | serialization.format                           
                                    | \u00FE               |
+------------------------------+------------------------------------------------------------------------------------+----------------------+
Fetched 32 row(s) in 0.10s
We can see that the filed delimiter is correctly parsed to extended ascii 
character with decimal value 254(last three lines in above log). However, when 
running query "select * from text_thorn_ecirc_newline", the result is :
[nobida147:21000] > select * from text_thorn_ecirc_newline;
Query: select * from text_thorn_ecirc_newline
Query submitted at: 2016-08-03 11:01:01 (Coordinator: http://0.0.0.0:25000)
Query progress can be monitored at: 
http://0.0.0.0:25000/query_plan?query_id=fd494eef6abac951:972191dbc5e1bd94
+------------------+------+------+------+
| col1             | col2 | col3 | col4 |
+------------------+------+------+------+
| one�two�3�4      | NULL | NULL | NULL |
| one�one�two�3�4  | NULL | NULL | NULL |
| one��two�3�4     | NULL | NULL | NULL |
| one��one�two�3�4 | NULL | NULL | NULL |
| one���two�3�4    | NULL | NULL | NULL |
+------------------+------+------+------+
Fetched 5 row(s) in 0.44s
After debug, I found the value of field_delim_.size()  is 2(we expect 1) in 
https://gerrit.cloudera.org/#/c/3314/7/be/src/exec/delimited-text-parser.cc@126.
Member function size() returns the bytes of the string as 
http://www.cplusplus.com/reference/string/string/size/ illustrates. If we can 
get the "correct size" of string here, I believed that we could use extended 
ASCII characters either.


And for tests of two corner cases, I have posted my logs in IMPALA-3945.




------------------ 原始邮件 ------------------
发件人: "jbapple";<[email protected]>;
发送时间: 2016年8月3日(星期三) 上午6:52
收件人: "Yuanhao Luo"<[email protected]>; 
抄送: "dev@impala"<[email protected]>; 
主题: Re: Re: Re: IMPALA-2428 Support multiple-character string as the field 
delimiter



Also, you asked

> I'm wondering whether have you ever test these two cases.

I do not know. Can you check and report back what you find?

On Tue, Aug 2, 2016 at 3:49 PM, Jim Apple <[email protected]> wrote:
>> What's more, in this patch, we can use only standard ASCII characters(with
>> decimal value from 0 to 127) in ascii or octal format to set filed
>> terminator, but not extended ASCII characters(with decimal value from 128 to
>> 255) or standard ASCII characters in unicode, decimal or hexadecimal format.
>
> Today, can field terminators be set to extended ASCIII characters (not
> with octal, but raw)?

Reply via email to