[jira] [Commented] (HIVE-9201) Lazy functions do not handle newlines and carriage returns properly

Yongzhi Chen (JIRA) Tue, 23 Dec 2014 15:43:55 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-9201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257688#comment-14257688
 ]


Yongzhi Chen commented on HIVE-9201:
------------------------------------

Three rows are returned because hadoop method 
org.apache.hadoop.mapred.LineRecordReader.readDefaultLine  use \r and \n as line
terminator. So hive need to process the \r and \n chars before call the method.
 Map job uses LazyUtils.writeEscaped method to escape special chars (such as 
control characters). The method just blindly add escape chars before the chars 
needing escaped. There are two issues: first \r and \n not in the chars needed 
to be escaped. second, even they are added, they should be escaped differently: 
for just adding escape char (such as \ ) before them can not solve our problem, 
the char with value 13 and 10 still in the stream. So we should process the two 
chars differently. For example replace '\r' with two chars: escape char and 
char 'r' . These logic can be add in the LazyUtils.writeEscaped method. The 
processed stream can go through 
org.apache.hadoop.mapred.LineRecordReader.readDefaultLine method without logic 
error(such errors as one row becomes 3 rows). Then in LazyString.init method, 
when we remove the escape chars, we know convert '\' '\r' to char 13.
Attach the fix patch.


> Lazy functions do not handle newlines and carriage returns properly
> -------------------------------------------------------------------
>
>                 Key: HIVE-9201
>                 URL: https://issues.apache.org/jira/browse/HIVE-9201
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Yongzhi Chen
>            Assignee: Yongzhi Chen
>
> Hive returns wrong result when returning string has char \r or \n in it.  
> This happens when the query can trigger mapreduce jobs. 
> For example, for a table named strsim with only one row:
> As shown following, query 1 returns 1 row while query 2 returns 3 rows.
> Query 1:
> select "abc", narray from strsim LATERAL VIEW explode(array(1)) C AS narray;
> Query 2:
> select "a\rb\nc", narray from strsim LATERAL VIEW explode(array(1)) C AS 
> narray;
> select "abc", narray from strsim LATERAL VIEW e 
> xplode(array(1)) C AS narray;
> INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
> INFO  : Job running in-process (local Hadoop)
> INFO  : 2014-12-23 15:00:08,958 Stage-1 map = 0%,  reduce = 0%
> INFO  : Ended Job = job_local1178499218_0015
> +------+---------+--+
> 1 row selected (1.283 seconds)
> | _c0  | narray  |
> +------+---------+--+
> | abc  | 1       |
> +------+---------+--+
> select "a\rb\nc", narray from strsim LATERAL VI 
> EW explode(array(1)) C AS narray;
> INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
> INFO  : Job running in-process (local Hadoop)
> INFO  : 2014-12-23 15:04:35,441 Stage-1 map = 0%,  reduce = 0%
> INFO  : Ended Job = job_local1816711099_0016
> +------+---------+--+
> 3 rows selected (1.135 seconds)
> | _c0  | narray  |
> +------+---------+--+
> | a    | NULL    |
> | b    | NULL    |
> | c    | 1       |
> +------+---------+--+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9201) Lazy functions do not handle newlines and carriage returns properly

Reply via email to