Re: table from sequence file

Arvind Prabhakar Fri, 16 Apr 2010 10:14:05 -0700

On Thu, Apr 15, 2010 at 7:00 PM, Edward Capriolo <[email protected]>wrote:


>
>
> On Thu, Apr 15, 2010 at 7:23 PM, Arvind Prabhakar <[email protected]>wrote:
>
>> On Thu, Apr 15, 2010 at 1:23 PM, Edward Capriolo 
>> <[email protected]>wrote:
>>
>>>
>>>
>>> On Thu, Apr 15, 2010 at 3:00 PM, Arvind Prabhakar 
>>> <[email protected]>wrote:
>>>
>>>> Hi Sagar,
>>>>
>>>> Looks like your source file has custom writable types in it. If that is
>>>> the case, implementing a SerDe that works with that type may not be that
>>>> straight forward, although doable.
>>>>
>>>> An alternative would be to implement a custom RecordReader that converts
>>>> the value of your custom writable to Struct type which can then be queried
>>>> directly.
>>>>
>>>> Arvind
>>>>
>>>>
>>>> On Thu, Apr 15, 2010 at 1:06 AM, Sagar Naik <[email protected]>wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> My data is in the value field of a sequence file.
>>>>> The value field has subfields in it. I am trying to create table using
>>>>> these subfields.
>>>>> Example:
>>>>> <KEY> <VALUE>
>>>>> <KEY_FIELD1, KEYFIELD 2>  forms the key
>>>>> <VALUE_FIELD1, VALUE_FIELD2, VALUE_FIELD3>.
>>>>> So i am trying to create a table from VALUE_FIELD*
>>>>>
>>>>> CREATE EXTERNAL TABLE table_name (VALUE_FIELD1 as BIGINT, VALUE_FIELD2
>>>>> as string, VALUE_FIELD3 as BIGINT ) STORED AS SEQUENCEFILE;
>>>>>
>>>>> I am planing to a write a custom SerDe implementation and custom
>>>>> SequenceFileReader
>>>>> Pl let me knw if I am on the right track.
>>>>>
>>>>>
>>>>> -Sagar
>>>>
>>>>
>>>>
>>> I am actually having lots of trouble with this.
>>> I have a sequence file that opens fine with
>>> /home/edward/hadoop/hadoop-0.20.2/bin/hadoop dfs -text
>>> /home/edward/Downloads/seq/seq
>>>
>>> create external table keyonly( ver string , theid int, thedate string )
>>> row format delimited fields terminated by ','
>>> STORED AS
>>> inputformat 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat'
>>> outputformat
>>> 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat'
>>>
>>> location '/home/edward/Downloads/seq';
>>>
>>>
>>>
>>> Also tried
>>> inputformat 'org.apache.hadoop.mapred.SequenceFileInputFormat'
>>> or stored as SEQUENCEFILE
>>>
>>> I always get this...
>>>
>>> 2010-04-15 13:10:43,849 ERROR CliDriver
>>> (SessionState.java:printError(255)) - Failed with exception
>>> java.io.IOException:java.io.EOFException
>>> java.io.IOException: java.io.EOFException
>>>     at
>>> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:332)
>>>     at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:120)
>>>     at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:681)
>>>     at
>>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:146)
>>>     at
>>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197)
>>>     at
>>> org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:510)
>>>     at
>>> org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_key_only(TestCliDriver.java:79)
>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>     at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>     at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>     at junit.framework.TestCase.runTest(TestCase.java:154)
>>>     at junit.framework.TestCase.runBare(TestCase.java:127)
>>>     at junit.framework.TestResult$1.protect(TestResult.java:106)
>>>     at junit.framework.TestResult.runProtected(TestResult.java:124)
>>>     at junit.framework.TestResult.run(TestResult.java:109)
>>>     at junit.framework.TestCase.run(TestCase.java:118)
>>>     at junit.framework.TestSuite.runTest(TestSuite.java:208)
>>>     at junit.framework.TestSuite.run(TestSuite.java:203)
>>>     at
>>> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:422)
>>>     at
>>> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:931)
>>>     at
>>> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:785)
>>> Caused by: java.io.EOFException
>>>     at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207)
>>>     at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197)
>>>     at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:136)
>>>     at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
>>>     at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)
>>>     at
>>> org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92)
>>>     at
>>> org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101)
>>>     at
>>> org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169)
>>>     at
>>> org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179)
>>>     at
>>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1520)
>>>     at
>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>>>     at
>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>>>     at
>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>>>     at
>>> org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>>>     at
>>> org.apache.hadoop.mapred.SequenceFileAsTextRecordReader.<init>(SequenceFileAsTextRecordReader.java:44)
>>>     at
>>> org.apache.hadoop.mapred.SequenceFileAsTextInputFormat.getRecordReader(SequenceFileAsTextInputFormat.java:43)
>>>     at
>>> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:296)
>>>     at
>>> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:311)
>>>     ... 21 more
>>>
>>> Does anyone have a clue on what I am doing wrong??
>>>
>>>
>> The SequenceFileAsTextInputFormat converts the sequence record values to
>> string using the toString() invocation. Assuming that your data has a custom
>> writable that has multiple fields in it, I don't think it is possible for
>> you to map the individual bits to different columns.
>>
>> Can you try doing the following:
>>
>> create external table dummy( fullvalue string)
>> stored as inputformat
>> 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat'
>> outputformat'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>>
>> location '/home/edward/Downloads/seq';
>>
>> and then doing a select * from dummy.
>>
>> Arvind
>>
>
>
> [edw...@ec hive]$ head -1 /home/edward/Downloads/seq/seq | od -a
> 0000000   S   E   Q ack  em   o   r   g   .   a   p   a   c   h   e   .
> 0000020   h   a   d   o   o   p   .   i   o   .   T   e   x   t  em   o
> 0000040   r   g   .   a   p   a   c   h   e   .   h   a   d   o   o   p
> 0000060   .   i   o   .   T   e   x   t soh soh   '   o   r   g   .   a
> 0000100   p   a   c   h   e   .   h   a   d   o   o   p   .   i   o   .
> 0000120   c   o   m   p   r   e   s   s   .   G   z   i   p   C   o   d
> 0000140   e   c nul nul nul nul   =   4  ff   Y   F   s   V  so   4   "
> 0000160   R   +   X enq dle   T del del del del   =   4  ff   Y   F   s
> 0000200   V  so   4   "   R   +   X enq dle   T soh etb  us  vt  bs nul
>
>
> 2010-04-15 18:45:24,954 ERROR CliDriver (SessionState.java:printError(255))
> - Failed with exception java.io.IOException:java.io.EOFException
>
> java.io.IOException: java.io.EOFException
>     at
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:332)
>     at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:120)
>     at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:681)
>     at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:146)
>     at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197)
>     at
> org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:510)
>     at
> org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_key_only(TestCliDriver.java:79)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>     at java.lang.reflect.Method.invoke(Method.java:597)
>     at junit.framework.TestCase.runTest(TestCase.java:154)
>     at junit.framework.TestCase.runBare(TestCase.java:127)
>     at junit.framework.TestResult$1.protect(TestResult.java:106)
>     at junit.framework.TestResult.runProtected(TestResult.java:124)
>     at junit.framework.TestResult.run(TestResult.java:109)
>     at junit.framework.TestCase.run(TestCase.java:118)
>     at junit.framework.TestSuite.runTest(TestSuite.java:208)
>     at junit.framework.TestSuite.run(TestSuite.java:203)
>     at
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:422)
>     at
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:931)
>     at
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:785)
> Caused by: java.io.EOFException
>     at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207)
>     at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197)
>     at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:136)
>     at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
>     at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)
>     at
> org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92)
>     at
> org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101)
>     at
> org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169)
>     at
> org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179)
>     at
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1520)
>     at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>     at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>     at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>     at
> org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>     at
> org.apache.hadoop.mapred.SequenceFileAsTextRecordReader.<init>(SequenceFileAsTextRecordReader.java:44)
>     at
> org.apache.hadoop.mapred.SequenceFileAsTextInputFormat.getRecordReader(SequenceFileAsTextInputFormat.java:43)
>     at
> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:296)
>     at
> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:311)
>     ... 21 more
>
>
The compression being used here - gzip - is not suitable for splitting of
the input files. That could be the reason why you are seeing this exception.
Can you try using a different compression scheme such as bzip2, or perhaps
by not compressing the files at all?

Arvind

Re: table from sequence file

Reply via email to