On Wed, Sep 22, 2010 at 12:08 AM, Tianqiang Li <peter...@gmail.com> wrote: > Hi, > I have customized InputFormat class to read our log format in our hadoop job > and Pig, which is built on top of Hadoop 0.20 api, now I'd like to re-use > this inputformat to load data into Hive table by specifying InputFormat, and > a Serde when I create a table like below: > > CREATE TABLE rawlog_test ( > user_id STRING, > tag STRING, > my_timestamp STRING ) > ROW FORMAT SERDE 'x.y.z.mySerDe' > STORED AS INPUTFORMAT 'x.y.z.myInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileOutputFormat' ; > > Then I run: > load data inpath '/rawlog.txt' into table rawlog_test; > > No error show up on screen but I found the deserialize function never got > called. An when I use select * from rawlog_test; An error was threw out: > > FAILED: Error in semantic analysis: line 1:14 Input Format must implement > InputFormat rawlog_test > > I search this on internet, found this might be related to Hive using old > api(0.17) of InputFormat, does anybody know are there a way to get 0.20api > worked on Hive? Adapt my code to old api need lots of work, and even if I > get it done, maintaining two version of code sounds like a bit unnecessary, > ( Pig 0.7 works well with my v0.20 of InputFormat, we need to use Pig and > Hive at different situations. ) , are there any way that I can work around > this? My version of Hive is 0.7, and hadoop is 0.20.1 from CDH2. Thanks. > > Regards, > Peter > >
You can make a 20 InputFormat work with hive but its real PITA. The hbase and cassandra handler both do it.Essentially you have to Extend the new mapreduce input format and then implement methods in the old one, use final variables and chained method calls. Example here: https://issues.apache.org/jira/secure/attachment/12452140/hive-1434-4-patch.txt Essentially it if your input format is simple enough it is likely easier to write two separate classes for both old api and new. Use the mapred.* InputFormat with hive.