Re: table from sequence file

Edward Capriolo Thu, 15 Apr 2010 19:01:01 -0700

On Thu, Apr 15, 2010 at 7:23 PM, Arvind Prabhakar <[email protected]>wrote:


> On Thu, Apr 15, 2010 at 1:23 PM, Edward Capriolo <[email protected]>wrote:
>
>>
>>
>> On Thu, Apr 15, 2010 at 3:00 PM, Arvind Prabhakar <[email protected]>wrote:
>>
>>> Hi Sagar,
>>>
>>> Looks like your source file has custom writable types in it. If that is
>>> the case, implementing a SerDe that works with that type may not be that
>>> straight forward, although doable.
>>>
>>> An alternative would be to implement a custom RecordReader that converts
>>> the value of your custom writable to Struct type which can then be queried
>>> directly.
>>>
>>> Arvind
>>>
>>>
>>> On Thu, Apr 15, 2010 at 1:06 AM, Sagar Naik <[email protected]>wrote:
>>>
>>>> Hi
>>>>
>>>> My data is in the value field of a sequence file.
>>>> The value field has subfields in it. I am trying to create table using
>>>> these subfields.
>>>> Example:
>>>> <KEY> <VALUE>
>>>> <KEY_FIELD1, KEYFIELD 2>  forms the key
>>>> <VALUE_FIELD1, VALUE_FIELD2, VALUE_FIELD3>.
>>>> So i am trying to create a table from VALUE_FIELD*
>>>>
>>>> CREATE EXTERNAL TABLE table_name (VALUE_FIELD1 as BIGINT, VALUE_FIELD2
>>>> as string, VALUE_FIELD3 as BIGINT ) STORED AS SEQUENCEFILE;
>>>>
>>>> I am planing to a write a custom SerDe implementation and custom
>>>> SequenceFileReader
>>>> Pl let me knw if I am on the right track.
>>>>
>>>>
>>>> -Sagar
>>>
>>>
>>>
>> I am actually having lots of trouble with this.
>> I have a sequence file that opens fine with
>> /home/edward/hadoop/hadoop-0.20.2/bin/hadoop dfs -text
>> /home/edward/Downloads/seq/seq
>>
>> create external table keyonly( ver string , theid int, thedate string )
>> row format delimited fields terminated by ','
>> STORED AS
>> inputformat 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat'
>> outputformat
>> 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat'
>>
>> location '/home/edward/Downloads/seq';
>>
>>
>>
>> Also tried
>> inputformat 'org.apache.hadoop.mapred.SequenceFileInputFormat'
>> or stored as SEQUENCEFILE
>>
>> I always get this...
>>
>> 2010-04-15 13:10:43,849 ERROR CliDriver
>> (SessionState.java:printError(255)) - Failed with exception
>> java.io.IOException:java.io.EOFException
>> java.io.IOException: java.io.EOFException
>>     at
>> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:332)
>>     at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:120)
>>     at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:681)
>>     at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:146)
>>     at
>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197)
>>     at
>> org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:510)
>>     at
>> org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_key_only(TestCliDriver.java:79)
>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>     at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>     at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>     at junit.framework.TestCase.runTest(TestCase.java:154)
>>     at junit.framework.TestCase.runBare(TestCase.java:127)
>>     at junit.framework.TestResult$1.protect(TestResult.java:106)
>>     at junit.framework.TestResult.runProtected(TestResult.java:124)
>>     at junit.framework.TestResult.run(TestResult.java:109)
>>     at junit.framework.TestCase.run(TestCase.java:118)
>>     at junit.framework.TestSuite.runTest(TestSuite.java:208)
>>     at junit.framework.TestSuite.run(TestSuite.java:203)
>>     at
>> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:422)
>>     at
>> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:931)
>>     at
>> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:785)
>> Caused by: java.io.EOFException
>>     at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207)
>>     at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197)
>>     at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:136)
>>     at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
>>     at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)
>>     at
>> org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92)
>>     at
>> org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101)
>>     at
>> org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169)
>>     at
>> org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179)
>>     at
>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1520)
>>     at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>>     at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>>     at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>>     at
>> org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>>     at
>> org.apache.hadoop.mapred.SequenceFileAsTextRecordReader.<init>(SequenceFileAsTextRecordReader.java:44)
>>     at
>> org.apache.hadoop.mapred.SequenceFileAsTextInputFormat.getRecordReader(SequenceFileAsTextInputFormat.java:43)
>>     at
>> org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:296)
>>     at
>> org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:311)
>>     ... 21 more
>>
>> Does anyone have a clue on what I am doing wrong??
>>
>>
> The SequenceFileAsTextInputFormat converts the sequence record values to
> string using the toString() invocation. Assuming that your data has a custom
> writable that has multiple fields in it, I don't think it is possible for
> you to map the individual bits to different columns.
>
> Can you try doing the following:
>
> create external table dummy( fullvalue string)
> stored as inputformat
> 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat'
> outputformat'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>
> location '/home/edward/Downloads/seq';
>
> and then doing a select * from dummy.
>
> Arvind
>


[edw...@ec hive]$ head -1 /home/edward/Downloads/seq/seq | od -a
0000000   S   E   Q ack  em   o   r   g   .   a   p   a   c   h   e   .
0000020   h   a   d   o   o   p   .   i   o   .   T   e   x   t  em   o
0000040   r   g   .   a   p   a   c   h   e   .   h   a   d   o   o   p
0000060   .   i   o   .   T   e   x   t soh soh   '   o   r   g   .   a
0000100   p   a   c   h   e   .   h   a   d   o   o   p   .   i   o   .
0000120   c   o   m   p   r   e   s   s   .   G   z   i   p   C   o   d
0000140   e   c nul nul nul nul   =   4  ff   Y   F   s   V  so   4   "
0000160   R   +   X enq dle   T del del del del   =   4  ff   Y   F   s
0000200   V  so   4   "   R   +   X enq dle   T soh etb  us  vt  bs nul


2010-04-15 18:45:24,954 ERROR CliDriver (SessionState.java:printError(255))
- Failed with exception java.io.IOException:java.io.EOFException
java.io.IOException: java.io.EOFException
    at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:332)
    at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:120)
    at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:681)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:146)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:197)
    at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:510)
    at
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_key_only(TestCliDriver.java:79)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at junit.framework.TestCase.runTest(TestCase.java:154)
    at junit.framework.TestCase.runBare(TestCase.java:127)
    at junit.framework.TestResult$1.protect(TestResult.java:106)
    at junit.framework.TestResult.runProtected(TestResult.java:124)
    at junit.framework.TestResult.run(TestResult.java:109)
    at junit.framework.TestCase.run(TestCase.java:118)
    at junit.framework.TestSuite.runTest(TestSuite.java:208)
    at junit.framework.TestSuite.run(TestSuite.java:203)
    at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:422)
    at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:931)
    at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:785)
Caused by: java.io.EOFException
    at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207)
    at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197)
    at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:136)
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)
    at
org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92)
    at
org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101)
    at
org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169)
    at
org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179)
    at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1520)
    at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
    at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
    at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
    at
org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
    at
org.apache.hadoop.mapred.SequenceFileAsTextRecordReader.<init>(SequenceFileAsTextRecordReader.java:44)
    at
org.apache.hadoop.mapred.SequenceFileAsTextInputFormat.getRecordReader(SequenceFileAsTextInputFormat.java:43)
    at
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:296)
    at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:311)
    ... 21 more

Re: table from sequence file

Reply via email to