[jira] [Created] (HIVE-6428) concat_ws mangles non-ASCII characters

Nick Dimiduk (JIRA) Thu, 13 Feb 2014 15:06:59 -0800

Nick Dimiduk created HIVE-6428:
----------------------------------

             Summary: concat_ws mangles non-ASCII characters
                 Key: HIVE-6428
                 URL: https://issues.apache.org/jira/browse/HIVE-6428
             Project: Hive
          Issue Type: Bug
          Components: UDF
    Affects Versions: 0.13.0
            Reporter: Nick Dimiduk
            Priority: Critical
         Attachments: ru.txt


Marked critical because this results in data loss from using built-in 
functionality. I think the issue is {{concat_ws}}, though I suppose it could be 
the VIEW as well.

Hive is losing the distinction between non-ASCII characters, folding distinct 
values into the same value. Here are steps to reproduce, and I've attached a 
small sample containing 3 distinct lines from the larger input file.

Grab sample data, confirm the number of total records and the number of unique 
combinations of the first two columns match.
{noformat}
$ mkdir /tmp/pagecounts
$ cd /tmp/pagecounts
$ wget 
http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-20081001-000000.gz
$ gzcat pagecounts-20081001-000000.gz | cut -d\  -f1,2 | wc -l
 4268675
$ gzcat pagecounts-20081001-000000.gz | cut -d\  -f1,2 | sort | uniq | wc -l
 4268675
{noformat}

Create hive table over input data.
{noformat}
CREATE EXTERNAL TABLE pagecounts (projectcode STRING, pagename STRING, 
pageviews STRING, bytes STRING)
ROW FORMAT
  DELIMITED FIELDS TERMINATED BY ' '
  LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/tmp/pagecounts/';
{noformat}

confirm number of unique combinations of the first two columns
{noformat}
SELECT count(DISTINCT projectcode, pagename) FROM pagecounts;
=> 4268675
{noformat}

Create a view over the raw data, concatenating first two columns. Distinct 
count does not match.
{noformat}
CREATE VIEW pgc_simple (rowkey, pageviews, bytes) AS
SELECT concat_ws('/', projectcode, pagename), pageviews, bytes
FROM pagecounts;

SELECT count(DISTINCT rowkey) FROM pgc_simple;
=> 4268561
{noformat}

Perform same "view" from shell. distinct count is retained.
{noformat}
$ gzcat pagecounts-20081001-000000.gz | awk '{print $1 "/" $2}' | wc -l
 4268675
$ gzcat pagecounts-20081001-000000.gz | awk '{print $1 "/" $2}' | sort | uniq | 
wc -l                                                                           
                            
 4268675
{noformat}

Look at some data.
{noformat}
$ hive -e "SELECT i.* FROM (SELECT rowkey, count(*) AS nbr FROM pgc_simple 
GROUP BY rowkey) i WHERE i.nbr > 1;" | tail
OK
Time taken: 40.493 seconds, Fetched: 37 row(s)
ru/?�N  2
ru/�    3
zh/?��  2
zh/Category:��  2
zh/Wikipedia:�� 5
zh/�    2
zh/��   11
zh/��˹��        2
zh/���� 17
zh/������       15
{noformat}

Choose 2nd line of output to inspect on the shell. My locale isn't able to find 
a character for the codepoints, but sort | uniq identify them as different.
{noformat}
$ gzcat pagecounts-20081001-000000.gz | egrep '^ru \?. ' | cut -d\  -f1,2 | 
sort | uniq
ru ??
ru ??
ru ??
{noformat}

Print them as C-escape codes. They are indeed distinct.
{noformat}
$ cat pagecounts-20081001-000000 | egrep '^ru \?. ' | cut -d\  -f1,2 | od -c
0000000    r   u       ? 240  \n   r   u       ? 244  \n   r   u       ?
0000020  247  \n
0000022
{noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (HIVE-6428) concat_ws mangles non-ASCII characters

Reply via email to