Re: Re: Re： IMPALA-2428 Support multiple-character string as the field delimiter

Jim Apple Thu, 04 Aug 2016 10:49:44 -0700

Unfortunately, I have exceeded my time budget for working on an issue
of this complexity without a concise and clear design document.


I may be able to help later if a concise and clear design document
becomes available.

On Tue, Aug 2, 2016 at 8:41 PM, Yuanhao Luo
<[email protected]> wrote:
>
> Hello, Jim Apple.
>
> For now in my commit, field terminators can not be set to extended ASCIII
> characters.
> After using statement "create table text_thorn_ecirc_newline(col1 string,
> col2 string, col3 int, col4 int) row format delimited fields terminated by
> 'þ' escaped by '-22' lines termiated by '\n';"  to create table, the result
> of "describe extended text_thorn_ecirc_newline" is :
>
> [nobida147:21000] > describe extended text_thorn_ecirc_newline;
> Query: describe extended text_thorn_ecirc_newline
> Query submitted at: 2016-08-03 10:57:11 (Coordinator: http://0.0.0.0:25000)
> Query progress can be monitored at:
> http://0.0.0.0:25000/query_plan?query_id=fd4bf0a9154be6a7:b3f6dcabe9dea3ba
> +------------------------------+------------------------------------------------------------------------------------+----------------------+
> | name                         | type
> | comment              |
> +------------------------------+------------------------------------------------------------------------------------+----------------------+
> | # col_name                   | data_type
> | comment              |
> |                              | NULL
> | NULL                 |
> | col1                         | string
> | NULL                 |
> | col2                         | string
> | NULL                 |
> | col3                         | int
> | NULL                 |
> | col4                         | int
> | NULL                 |
> |                              | NULL
> | NULL                 |
> | # Detailed Table Information | NULL
> | NULL                 |
> | Database:                    | multi_byte_test2
> | NULL                 |
> | Owner:                       | root
> | NULL                 |
> | CreateTime:                  | Wed Aug 03 10:55:25 CST 2016
> | NULL                 |
> | LastAccessTime:              | UNKNOWN
> | NULL                 |
> | Protect Mode:                | None
> | NULL                 |
> | Retention:                   | 0
> | NULL                 |
> | Location:                    |
> hdfs://localhost:20500/test-warehouse/multi_byte_test2.db/text_thorn_ecirc_newline
> | NULL                 |
> | Table Type:                  | MANAGED_TABLE
> | NULL                 |
> | Table Parameters:            | NULL
> | NULL                 |
> |                              | transient_lastDdlTime
> | 1470192925           |
> |                              | NULL
> | NULL                 |
> | # Storage Information        | NULL
> | NULL                 |
> | SerDe Library:               |
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> | NULL                 |
> | InputFormat:                 | org.apache.hadoop.mapred.TextInputFormat
> | NULL                 |
> | OutputFormat:                |
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> | NULL                 |
> | Compressed:                  | No
> | NULL                 |
> | Num Buckets:                 | 0
> | NULL                 |
> | Bucket Columns:              | []
> | NULL                 |
> | Sort Columns:                | []
> | NULL                 |
> | Storage Desc Params:         | NULL
> | NULL                 |
> |                              | escape.delim
> | -22                  |
> |                              | field.delim
> | \u00FE               |
> |                              | line.delim
> | \n                   |
> |                              | serialization.format
> | \u00FE               |
> +------------------------------+------------------------------------------------------------------------------------+----------------------+
> Fetched 32 row(s) in 0.10s
>
> We can see that the filed delimiter is correctly parsed to extended ascii
> character with decimal value 254(last three lines in above log). However,
> when running query "select * from text_thorn_ecirc_newline", the result is :
>
> [nobida147:21000] > select * from text_thorn_ecirc_newline;
> Query: select * from text_thorn_ecirc_newline
> Query submitted at: 2016-08-03 11:01:01 (Coordinator: http://0.0.0.0:25000)
> Query progress can be monitored at:
> http://0.0.0.0:25000/query_plan?query_id=fd494eef6abac951:972191dbc5e1bd94
> +------------------+------+------+------+
> | col1             | col2 | col3 | col4 |
> +------------------+------+------+------+
> | one�two�3�4      | NULL | NULL | NULL |
> | one�one�two�3�4  | NULL | NULL | NULL |
> | one��two�3�4     | NULL | NULL | NULL |
> | one��one�two�3�4 | NULL | NULL | NULL |
> | one���two�3�4    | NULL | NULL | NULL |
> +------------------+------+------+------+
> Fetched 5 row(s) in 0.44s
>
> After debug, I found the value of field_delim_.size()  is 2(we expect 1) in
> https://gerrit.cloudera.org/#/c/3314/7/be/src/exec/delimited-text-parser.cc@126.
> Member function size() returns the bytes of the string as
> http://www.cplusplus.com/reference/string/string/size/ illustrates. If we
> can get the "correct size" of string here, I believed that we could use
> extended ASCII characters either.
>
> And for tests of two corner cases, I have posted my logs in IMPALA-3945.
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "jbapple";<[email protected]>;
> 发送时间: 2016年8月3日(星期三) 上午6:52
> 收件人: "Yuanhao Luo"<[email protected]>;
> 抄送: "dev@impala"<[email protected]>;
> 主题: Re: Re: Re： IMPALA-2428 Support multiple-character string as the field
> delimiter
>
> Also, you asked
>
>> I'm wondering whether have you ever test these two cases.
>
> I do not know. Can you check and report back what you find?
>
> On Tue, Aug 2, 2016 at 3:49 PM, Jim Apple <[email protected]> wrote:
>>> What's more, in this patch, we can use only standard ASCII
>>> characters(with
>>> decimal value from 0 to 127) in ascii or octal format to set filed
>>> terminator, but not extended ASCII characters(with decimal value from 128
>>> to
>>> 255) or standard ASCII characters in unicode, decimal or hexadecimal
>>> format.
>>
>> Today, can field terminators be set to extended ASCIII characters (not
>> with octal, but raw)?

Re: Re: Re： IMPALA-2428 Support multiple-character string as the field delimiter

Reply via email to