On Thu, Apr 15, 2010 at 7:00 PM, Edward Capriolo <[email protected]>wrote:
> > > On Thu, Apr 15, 2010 at 7:23 PM, Arvind Prabhakar <[email protected]>wrote: > >> On Thu, Apr 15, 2010 at 1:23 PM, Edward Capriolo >> <[email protected]>wrote: >> >>> >>> >>> On Thu, Apr 15, 2010 at 3:00 PM, Arvind Prabhakar >>> <[email protected]>wrote: >>> >>>> Hi Sagar, >>>> >>>> Looks like your source file has custom writable types in it. If that is >>>> the case, implementing a SerDe that works with that type may not be that >>>> straight forward, although doable. >>>> >>>> An alternative would be to implement a custom RecordReader that converts >>>> the value of your custom writable to Struct type which can then be queried >>>> directly. >>>> >>>> Arvind >>>> >>>> >>>> On Thu, Apr 15, 2010 at 1:06 AM, Sagar Naik <[email protected]>wrote: >>>> >>>>> Hi >>>>> >>>>> My data is in the value field of a sequence file. >>>>> The value field has subfields in it. I am trying to create table using >>>>> these subfields. >>>>> Example: >>>>> <KEY> <VALUE> >>>>> <KEY_FIELD1, KEYFIELD 2> forms the key >>>>> <VALUE_FIELD1, VALUE_FIELD2, VALUE_FIELD3>. >>>>> So i am trying to create a table from VALUE_FIELD* >>>>> >>>>> CREATE EXTERNAL TABLE table_name (VALUE_FIELD1 as BIGINT, VALUE_FIELD2 >>>>> as string, VALUE_FIELD3 as BIGINT ) STORED AS SEQUENCEFILE; >>>>> >>>>> I am planing to a write a custom SerDe implementation and custom >>>>> SequenceFileReader >>>>> Pl let me knw if I am on the right track. >>>>> >>>>> >>>>> -Sagar >>>> >>>> >>>> >>> I am actually having lots of trouble with this. >>> I have a sequence file that opens fine with >>> /home/edward/hadoop/hadoop-0.20.2/bin/hadoop dfs -text >>> /home/edward/Downloads/seq/seq >>> >>> create external table keyonly( ver string , theid int, thedate string ) >>> row format delimited fields terminated by ',' >>> STORED AS >>> inputformat 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat' >>> outputformat >>> 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat' >>> >>> location '/home/edward/Downloads/seq'; >>> >>> >>> >>> Also tried >>> inputformat 'org.apache.hadoop.mapred.SequenceFileInputFormat' >>> or stored as SEQUENCEFILE >>> >>> I always get this... >>> >>> 2010-04-15 13:10:43,849 ERROR CliDriver >>> (SessionState.java:printError(255)) - Failed with exception >>> java.io.IOException:java.io.EOFException >>> java.io.IOException: java.io.EOFException >>> at >>> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:332) >>> at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:120) >>> at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:681) >>> at >>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:146) >>> at >>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197) >>> at >>> org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:510) >>> at >>> org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_key_only(TestCliDriver.java:79) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> at junit.framework.TestCase.runTest(TestCase.java:154) >>> at junit.framework.TestCase.runBare(TestCase.java:127) >>> at junit.framework.TestResult$1.protect(TestResult.java:106) >>> at junit.framework.TestResult.runProtected(TestResult.java:124) >>> at junit.framework.TestResult.run(TestResult.java:109) >>> at junit.framework.TestCase.run(TestCase.java:118) >>> at junit.framework.TestSuite.runTest(TestSuite.java:208) >>> at junit.framework.TestSuite.run(TestSuite.java:203) >>> at >>> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:422) >>> at >>> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:931) >>> at >>> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:785) >>> Caused by: java.io.EOFException >>> at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207) >>> at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197) >>> at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:136) >>> at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58) >>> at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68) >>> at >>> org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92) >>> at >>> org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101) >>> at >>> org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169) >>> at >>> org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179) >>> at >>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1520) >>> at >>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428) >>> at >>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417) >>> at >>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412) >>> at >>> org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43) >>> at >>> org.apache.hadoop.mapred.SequenceFileAsTextRecordReader.<init>(SequenceFileAsTextRecordReader.java:44) >>> at >>> org.apache.hadoop.mapred.SequenceFileAsTextInputFormat.getRecordReader(SequenceFileAsTextInputFormat.java:43) >>> at >>> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:296) >>> at >>> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:311) >>> ... 21 more >>> >>> Does anyone have a clue on what I am doing wrong?? >>> >>> >> The SequenceFileAsTextInputFormat converts the sequence record values to >> string using the toString() invocation. Assuming that your data has a custom >> writable that has multiple fields in it, I don't think it is possible for >> you to map the individual bits to different columns. >> >> Can you try doing the following: >> >> create external table dummy( fullvalue string) >> stored as inputformat >> 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat' >> outputformat'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' >> >> location '/home/edward/Downloads/seq'; >> >> and then doing a select * from dummy. >> >> Arvind >> > > > [edw...@ec hive]$ head -1 /home/edward/Downloads/seq/seq | od -a > 0000000 S E Q ack em o r g . a p a c h e . > 0000020 h a d o o p . i o . T e x t em o > 0000040 r g . a p a c h e . h a d o o p > 0000060 . i o . T e x t soh soh ' o r g . a > 0000100 p a c h e . h a d o o p . i o . > 0000120 c o m p r e s s . G z i p C o d > 0000140 e c nul nul nul nul = 4 ff Y F s V so 4 " > 0000160 R + X enq dle T del del del del = 4 ff Y F s > 0000200 V so 4 " R + X enq dle T soh etb us vt bs nul > > > 2010-04-15 18:45:24,954 ERROR CliDriver (SessionState.java:printError(255)) > - Failed with exception java.io.IOException:java.io.EOFException > > java.io.IOException: java.io.EOFException > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:332) > at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:120) > at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:681) > at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:146) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197) > at > org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:510) > at > org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_key_only(TestCliDriver.java:79) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at junit.framework.TestCase.runTest(TestCase.java:154) > at junit.framework.TestCase.runBare(TestCase.java:127) > at junit.framework.TestResult$1.protect(TestResult.java:106) > at junit.framework.TestResult.runProtected(TestResult.java:124) > at junit.framework.TestResult.run(TestResult.java:109) > at junit.framework.TestCase.run(TestCase.java:118) > at junit.framework.TestSuite.runTest(TestSuite.java:208) > at junit.framework.TestSuite.run(TestSuite.java:203) > at > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:422) > at > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:931) > at > org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:785) > Caused by: java.io.EOFException > at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207) > at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197) > at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:136) > at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58) > at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68) > at > org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92) > at > org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101) > at > org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169) > at > org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179) > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1520) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412) > at > org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43) > at > org.apache.hadoop.mapred.SequenceFileAsTextRecordReader.<init>(SequenceFileAsTextRecordReader.java:44) > at > org.apache.hadoop.mapred.SequenceFileAsTextInputFormat.getRecordReader(SequenceFileAsTextInputFormat.java:43) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:296) > at > org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:311) > ... 21 more > > The compression being used here - gzip - is not suitable for splitting of the input files. That could be the reason why you are seeing this exception. Can you try using a different compression scheme such as bzip2, or perhaps by not compressing the files at all? Arvind
