Problem with delimiter

Eric Gottschalk Wed, 14 Jan 2009 13:39:05 -0800

I am trying to load a text file that uses ascii(254) as a delimiter
Column #1 is time
Column #2 is ID


Contents of this file (test_data.txt):
11-25-2008-14:37:44þ546460819054541434
11-25-2008-15:05:34þ546460554662946399
11-25-2008-15:06:21þ556460862552067034
11-25-2008-15:17:51þ576460863857903582
11-25-2008-14:37:44þ546460819054541434
11-25-2008-15:05:34þ546460554662946399
11-25-2008-15:06:21þ556460862552067034
11-25-2008-15:17:51þ576460863857903582
11-25-2008-14:37:44þ546460819054541434
11-25-2008-15:05:34þ546460554662946399
11-25-2008-15:06:21þ556460862552067034
11-25-2008-15:17:51þ576460863857903582
11-25-2008-14:37:44þ546460819054541434
11-25-2008-15:05:34þ546460554662946399
11-25-2008-15:06:21þ556460862552067034
11-25-2008-15:17:51þ576460863857903582

I create a table in hive with the following syntax:

hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITEDFIELDS TERMINATED BY '254' STORED AS TEXTFILE;

OK
Time taken: 8.891 seconds

After this, I load the file

hive> LOAD DATA INPATH 'hdfs://localhost:54310/test_data.txt'OVERWRITE INTO TABLE test;

Loading data to table test
OK
Time taken: 0.623 seconds

To make sure that data is loaded correctly, I select only the firstcolumn (time)


hive> select time from test;
Total MapReduce jobs = 1
Starting Job = job_200901141536_0006, Tracking URL = 
http://localhost:50030/jobdetails.jsp?jobid=job_200901141536_0006

Kill Command = /home/hadoop/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_200901141536_0006

map = 0%,  reduce =0%
map = 50%,  reduce =0%
map = 100%,  reduce =0%
Ended Job = job_200901141536_0006
OK
11-25-2008-14:37:44�546460819054541434
11-25-2008-15:05:34�546460554662946399
11-25-2008-15:06:21�556460862552067034
11-25-2008-15:17:51�576460863857903582
11-25-2008-14:37:44�546460819054541434
11-25-2008-15:05:34�546460554662946399
11-25-2008-15:06:21�556460862552067034
11-25-2008-15:17:51�576460863857903582
11-25-2008-14:37:44�546460819054541434
11-25-2008-15:05:34�546460554662946399
11-25-2008-15:06:21�556460862552067034
11-25-2008-15:17:51�576460863857903582
11-25-2008-14:37:44�546460819054541434
11-25-2008-15:05:34�546460554662946399
11-25-2008-15:06:21�556460862552067034
11-25-2008-15:17:51�576460863857903582
Time taken: 12.25 seconds

It is obvious it did not parse the columns correctly.

To narrow the problem down, I created a version of this file where thedelimiter is a tab


Contents of this file (test_data_tab.txt):
11-25-2008-15:05:34     546460554662946399
11-25-2008-15:06:21     556460862552067034
11-25-2008-15:17:51     576460863857903582
11-25-2008-14:37:44     546460819054541434
11-25-2008-15:05:34     546460554662946399
11-25-2008-15:06:21     556460862552067034
11-25-2008-15:17:51     576460863857903582
11-25-2008-14:37:44     546460819054541434
11-25-2008-15:05:34     546460554662946399
11-25-2008-15:06:21     556460862552067034
11-25-2008-15:17:51     576460863857903582
11-25-2008-14:37:44     546460819054541434
11-25-2008-15:05:34     546460554662946399
11-25-2008-15:06:21     556460862552067034
11-25-2008-15:17:51     576460863857903582

I use a nearly identical syntax to create new table. The only changeis the new field delimiter '9' (tab)

hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITEDFIELDS TERMINATED BY '9' STORED AS TEXTFILE;

OK
Time taken: 0.105 seconds

Next I load the file

hive> LOAD DATA INPATH 'hdfs://localhost:54310/test_data_tab.txt'OVERWRITE INTO TABLE test;

Loading data to table test
OK
Time taken: 0.179 seconds

Check to see if the data has been loaded using the same query..

hive> select time from test;
Total MapReduce jobs = 1
Starting Job = job_200901141536_0007, Tracking URL = 
http://localhost:50030/jobdetails.jsp?jobid=job_200901141536_0007

Kill Command = /home/hadoop/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_200901141536_0007

map = 0%,  reduce =0%
map = 50%,  reduce =0%
map = 100%,  reduce =0%
Ended Job = job_200901141536_0007
OK
11-25-2008-15:05:34
11-25-2008-15:06:21
11-25-2008-15:17:51
11-25-2008-14:37:44
11-25-2008-15:05:34
11-25-2008-15:06:21
11-25-2008-15:17:51
11-25-2008-15:05:34
11-25-2008-15:06:21
11-25-2008-15:17:51
11-25-2008-14:37:44
11-25-2008-15:05:34
11-25-2008-15:06:21
11-25-2008-15:17:51
11-25-2008-14:37:44
Time taken: 10.553 seconds

Success. The import with ascii(9) worked, but ascii(254) failed. Isthere anyway to load this original data file with ascii(254) delimiter?

Problem with delimiter

Reply via email to