Nick Dimiduk created HIVE-6428: ---------------------------------- Summary: concat_ws mangles non-ASCII characters Key: HIVE-6428 URL: https://issues.apache.org/jira/browse/HIVE-6428 Project: Hive Issue Type: Bug Components: UDF Affects Versions: 0.13.0 Reporter: Nick Dimiduk Priority: Critical Attachments: ru.txt
Marked critical because this results in data loss from using built-in functionality. I think the issue is {{concat_ws}}, though I suppose it could be the VIEW as well. Hive is losing the distinction between non-ASCII characters, folding distinct values into the same value. Here are steps to reproduce, and I've attached a small sample containing 3 distinct lines from the larger input file. Grab sample data, confirm the number of total records and the number of unique combinations of the first two columns match. {noformat} $ mkdir /tmp/pagecounts $ cd /tmp/pagecounts $ wget http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-20081001-000000.gz $ gzcat pagecounts-20081001-000000.gz | cut -d\ -f1,2 | wc -l 4268675 $ gzcat pagecounts-20081001-000000.gz | cut -d\ -f1,2 | sort | uniq | wc -l 4268675 {noformat} Create hive table over input data. {noformat} CREATE EXTERNAL TABLE pagecounts (projectcode STRING, pagename STRING, pageviews STRING, bytes STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/tmp/pagecounts/'; {noformat} confirm number of unique combinations of the first two columns {noformat} SELECT count(DISTINCT projectcode, pagename) FROM pagecounts; => 4268675 {noformat} Create a view over the raw data, concatenating first two columns. Distinct count does not match. {noformat} CREATE VIEW pgc_simple (rowkey, pageviews, bytes) AS SELECT concat_ws('/', projectcode, pagename), pageviews, bytes FROM pagecounts; SELECT count(DISTINCT rowkey) FROM pgc_simple; => 4268561 {noformat} Perform same "view" from shell. distinct count is retained. {noformat} $ gzcat pagecounts-20081001-000000.gz | awk '{print $1 "/" $2}' | wc -l 4268675 $ gzcat pagecounts-20081001-000000.gz | awk '{print $1 "/" $2}' | sort | uniq | wc -l 4268675 {noformat} Look at some data. {noformat} $ hive -e "SELECT i.* FROM (SELECT rowkey, count(*) AS nbr FROM pgc_simple GROUP BY rowkey) i WHERE i.nbr > 1;" | tail OK Time taken: 40.493 seconds, Fetched: 37 row(s) ru/?�N 2 ru/� 3 zh/?�� 2 zh/Category:�� 2 zh/Wikipedia:�� 5 zh/� 2 zh/�� 11 zh/��˹�� 2 zh/���� 17 zh/������ 15 {noformat} Choose 2nd line of output to inspect on the shell. My locale isn't able to find a character for the codepoints, but sort | uniq identify them as different. {noformat} $ gzcat pagecounts-20081001-000000.gz | egrep '^ru \?. ' | cut -d\ -f1,2 | sort | uniq ru ?? ru ?? ru ?? {noformat} Print them as C-escape codes. They are indeed distinct. {noformat} $ cat pagecounts-20081001-000000 | egrep '^ru \?. ' | cut -d\ -f1,2 | od -c 0000000 r u ? 240 \n r u ? 244 \n r u ? 0000020 247 \n 0000022 {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)