[ 
https://issues.apache.org/jira/browse/HIVE-11785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aihua Xu updated HIVE-11785:
----------------------------
    Release Note: This change with HIVE-12820 in addition adds the support of 
carriage return and new line characters in the fields. Before this change, the 
user needs to preprocess the text by replacing them with some characters other 
than carriage return and new line in order for the files to be properly 
processed. With this change, it will automatically escape them if 
{{serialization.escape.crlf}} serde property is set to true. One incompatible 
change is: characters 'r' and 'n' cannot be used as separator or field 
delimiter   (was: This change disallows carriage return and new line characters 
to be used as field separators or escape character. While before this change, 
those were allowed while those cases could easily lead to incorrect results if 
the content also contain carriage return or new line. Since even carriage 
return or new line was escaped, line based input format in MapReduce used in 
Hive will break the lines by carriage return and new line only and lead to 
incorrect result.)

> Support escaping carriage return and new line for LazySimpleSerDe
> -----------------------------------------------------------------
>
>                 Key: HIVE-11785
>                 URL: https://issues.apache.org/jira/browse/HIVE-11785
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 2.0.0
>            Reporter: Aihua Xu
>            Assignee: Aihua Xu
>              Labels: TODOC2.0
>             Fix For: 2.0.0
>
>         Attachments: HIVE-11785.2.patch, HIVE-11785.3.patch, 
> HIVE-11785.patch, test.parquet
>
>
> Create the table and perform the queries as follows. You will see different 
> results when the setting changes. 
> The expected result should be:
> {noformat}
> 1     newline
> here
> 2     carriage return
> 3     both
> here
> {noformat}
> {noformat}
> hive> create table repo (lvalue int, charstring string) stored as parquet;
> OK
> Time taken: 0.34 seconds
> hive> load data inpath '/tmp/repo/test.parquet' overwrite into table repo;
> Loading data to table default.repo
> chgrp: changing ownership of 
> 'hdfs://nameservice1/user/hive/warehouse/repo/test.parquet': User does not 
> belong to hive
> Table default.repo stats: [numFiles=1, numRows=0, totalSize=610, 
> rawDataSize=0]
> OK
> Time taken: 0.732 seconds
> hive> set hive.fetch.task.conversion=more;
> hive> select * from repo;
> OK
> 1     newline
> here
> here  carriage return
> 3     both
> here
> Time taken: 0.253 seconds, Fetched: 3 row(s)
> hive> set hive.fetch.task.conversion=none;
> hive> select * from repo;
> Query ID = root_20150909113535_e081db8b-ccd9-4c44-aad9-d990ffb8edf3
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1441752031022_0006, Tracking URL = 
> http://host-10-17-81-63.coe.cloudera.com:8088/proxy/application_1441752031022_0006/
> Kill Command = 
> /opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/hadoop/bin/hadoop job  
> -kill job_1441752031022_0006
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 0
> 2015-09-09 11:35:54,127 Stage-1 map = 0%,  reduce = 0%
> 2015-09-09 11:36:04,664 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.98 
> sec
> MapReduce Total cumulative CPU time: 2 seconds 980 msec
> Ended Job = job_1441752031022_0006
> MapReduce Jobs Launched:
> Stage-Stage-1: Map: 1   Cumulative CPU: 2.98 sec   HDFS Read: 4251 HDFS 
> Write: 51 SUCCESS
> Total MapReduce CPU Time Spent: 2 seconds 980 msec
> OK
> 1     newline
> NULL  NULL
> 2     carriage return
> NULL  NULL
> 3     both
> NULL  NULL
> Time taken: 25.131 seconds, Fetched: 6 row(s)
> hive>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to