The problem here is that bytes are signed in java as a result I think we are limited to 0 to 128 ascii for delimiter values. Please file a JIRA for this and I can get a fix out to you to handle this case.
Ashish ________________________________________ From: Eric Gottschalk [egottsch...@gmail.com] Sent: Wednesday, January 14, 2009 1:50 PM To: hive-user@hadoop.apache.org Subject: Re: Problem with delimiter The extended table description for table #1: hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '254' STORED AS TEXTFILE; OK Time taken: 0.089 seconds hive> describe extended test; OK time string id string Detailed Table Information: Table(tableName:test,dbName:default,owner:hadoop,createTime: 1231969614,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols: [FieldSchema(name:time,type:string,comment:null), FieldSchema(name:id,type:string,comment:null)],location:hdfs:// localhost:54310/ test,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=254,field.delim=254}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{}) Time taken: 0.105 seconds For good measure, the table with the tab: hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '9' STORED AS TEXTFILE; OK Time taken: 0.124 seconds hive> describe extended test; OK time string id string Detailed Table Information: Table(tableName:test,dbName:default,owner:hadoop,createTime: 1231969673,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols: [FieldSchema(name:time,type:string,comment:null), FieldSchema(name:id,type:string,comment:null)],location:hdfs:// localhost:54310/ test,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=9,field.delim=9}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{}) Time taken: 0.087 seconds On Jan 14, 2009, at 1:43 PM, Ashish Thusoo wrote: > Can you do a > > describe extended test > > after you create it with the first create table statement and send > out the output. > > Ashish > ________________________________________ > From: Eric Gottschalk [egottsch...@gmail.com] > Sent: Wednesday, January 14, 2009 1:38 PM > To: hive-user@hadoop.apache.org > Subject: Problem with delimiter > > I am trying to load a text file that uses ascii(254) as a delimiter > Column #1 is time > Column #2 is ID > > Contents of this file (test_data.txt): > 11-25-2008-14:37:44þ546460819054541434 > 11-25-2008-15:05:34þ546460554662946399 > 11-25-2008-15:06:21þ556460862552067034 > 11-25-2008-15:17:51þ576460863857903582 > 11-25-2008-14:37:44þ546460819054541434 > 11-25-2008-15:05:34þ546460554662946399 > 11-25-2008-15:06:21þ556460862552067034 > 11-25-2008-15:17:51þ576460863857903582 > 11-25-2008-14:37:44þ546460819054541434 > 11-25-2008-15:05:34þ546460554662946399 > 11-25-2008-15:06:21þ556460862552067034 > 11-25-2008-15:17:51þ576460863857903582 > 11-25-2008-14:37:44þ546460819054541434 > 11-25-2008-15:05:34þ546460554662946399 > 11-25-2008-15:06:21þ556460862552067034 > 11-25-2008-15:17:51þ576460863857903582 > > I create a table in hive with the following syntax: > > hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED > FIELDS TERMINATED BY '254' STORED AS TEXTFILE; > OK > Time taken: 8.891 seconds > > After this, I load the file > > hive> LOAD DATA INPATH 'hdfs://localhost:54310/test_data.txt' > OVERWRITE INTO TABLE test; > Loading data to table test > OK > Time taken: 0.623 seconds > > To make sure that data is loaded correctly, I select only the first > column (time) > > hive> select time from test; > Total MapReduce jobs = 1 > Starting Job = job_200901141536_0006, Tracking URL = > http://localhost:50030/jobdetails.jsp?jobid=job_200901141536_0006 > Kill Command = /home/hadoop/hadoop/bin/../bin/hadoop job - > Dmapred.job.tracker=localhost:54311 -kill job_200901141536_0006 > map = 0%, reduce =0% > map = 50%, reduce =0% > map = 100%, reduce =0% > Ended Job = job_200901141536_0006 > OK > 11-25-2008-14:37:44�546460819054541434 > 11-25-2008-15:05:34�546460554662946399 > 11-25-2008-15:06:21�556460862552067034 > 11-25-2008-15:17:51�576460863857903582 > 11-25-2008-14:37:44�546460819054541434 > 11-25-2008-15:05:34�546460554662946399 > 11-25-2008-15:06:21�556460862552067034 > 11-25-2008-15:17:51�576460863857903582 > 11-25-2008-14:37:44�546460819054541434 > 11-25-2008-15:05:34�546460554662946399 > 11-25-2008-15:06:21�556460862552067034 > 11-25-2008-15:17:51�576460863857903582 > 11-25-2008-14:37:44�546460819054541434 > 11-25-2008-15:05:34�546460554662946399 > 11-25-2008-15:06:21�556460862552067034 > 11-25-2008-15:17:51�576460863857903582 > Time taken: 12.25 seconds > > It is obvious it did not parse the columns correctly. > > To narrow the problem down, I created a version of this file where > the delimiter is a tab > > Contents of this file (test_data_tab.txt): > 11-25-2008-15:05:34 546460554662946399 > 11-25-2008-15:06:21 556460862552067034 > 11-25-2008-15:17:51 576460863857903582 > 11-25-2008-14:37:44 546460819054541434 > 11-25-2008-15:05:34 546460554662946399 > 11-25-2008-15:06:21 556460862552067034 > 11-25-2008-15:17:51 576460863857903582 > 11-25-2008-14:37:44 546460819054541434 > 11-25-2008-15:05:34 546460554662946399 > 11-25-2008-15:06:21 556460862552067034 > 11-25-2008-15:17:51 576460863857903582 > 11-25-2008-14:37:44 546460819054541434 > 11-25-2008-15:05:34 546460554662946399 > 11-25-2008-15:06:21 556460862552067034 > 11-25-2008-15:17:51 576460863857903582 > > I use a nearly identical syntax to create new table. The only change > is the new field delimiter '9' (tab) > > hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED > FIELDS TERMINATED BY '9' STORED AS TEXTFILE; > OK > Time taken: 0.105 seconds > > Next I load the file > > hive> LOAD DATA INPATH 'hdfs://localhost:54310/test_data_tab.txt' > OVERWRITE INTO TABLE test; > Loading data to table test > OK > Time taken: 0.179 seconds > > Check to see if the data has been loaded using the same query.. > > hive> select time from test; > Total MapReduce jobs = 1 > Starting Job = job_200901141536_0007, Tracking URL = > http://localhost:50030/jobdetails.jsp?jobid=job_200901141536_0007 > Kill Command = /home/hadoop/hadoop/bin/../bin/hadoop job - > Dmapred.job.tracker=localhost:54311 -kill job_200901141536_0007 > map = 0%, reduce =0% > map = 50%, reduce =0% > map = 100%, reduce =0% > Ended Job = job_200901141536_0007 > OK > 11-25-2008-15:05:34 > 11-25-2008-15:06:21 > 11-25-2008-15:17:51 > 11-25-2008-14:37:44 > 11-25-2008-15:05:34 > 11-25-2008-15:06:21 > 11-25-2008-15:17:51 > 11-25-2008-15:05:34 > 11-25-2008-15:06:21 > 11-25-2008-15:17:51 > 11-25-2008-14:37:44 > 11-25-2008-15:05:34 > 11-25-2008-15:06:21 > 11-25-2008-15:17:51 > 11-25-2008-14:37:44 > Time taken: 10.553 seconds > > Success. The import with ascii(9) worked, but ascii(254) failed. Is > there anyway to load this original data file with ascii(254) > delimiter? > >