I'm having a problem getting the SequenceFileLoader, from the Piggybank, to
read sequence files whose values are block comressed (gzip'd). I'm using Pig
0.4.99.0+10, and Hadoop hadoop-0.20.1+152, via Cloudera.

Did the following:

* Copied the SequenceFileLoader class into my own project

* Removed

public LoadFunc.RequiredFieldResponse
fieldsToRead(LoadFunc.RequiredFieldList requiredFieldList)

because LoadFunc.RequiredFieldList isn't resolvable, and added

public void fieldsToRead(Schema schema)

* Jarred up the .class file

* Programmatically created a trivial sequence file of a few lines, with
IntWritable keys and Text values, using the basic code in an example in
Hadoop The Definitive Guide

* That file is successfully read and keys/values displayed, with "hadoop fs
-text", as well as with pig, doing the following:

grunt> register sequencefileloader.jar;
grunt> r = load '/path/to/sequence_file' using
com.foobar.SequenceFileLoader();
grunt> dump r;

* The sequence file with the compressed values is successfully read with
hadoop fs -text

* When doing the load step in pig with that file, the following results:

--
2010-02-19 16:59:14,489 [main] WARN  org.apache.hadoop.util.NativeCodeLoader
- Unable to load native-hadoop library for your platform..
. using builtin-java classes where applicable
2010-02-19 16:59:14,490 [main] INFO  org.apache.hadoop.io.compress.CodecPool
- Got brand-new decompressor
2010-02-19 16:59:14,498 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1018: Problem determining schema during load
Details at logfile: /path/to/pig_1266616744562.log
--

That log file contains the following:

--
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during
parsing. Problem determining schema during load
        at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1037)
        at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:981)
        at org.apache.pig.PigServer.registerQuery(PigServer.java:383)
        at
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:717)
        at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:273)
        at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
        at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
        at org.apache.pig.Main.main(Main.java:363)
Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Problem
determining schema during load
        at
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:734)
        at
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
        at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1031)
        ... 8 more
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1018:
Problem determining schema during load
        at
org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:155)
        at
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:732)
        ... 10 more
Caused by: java.io.EOFException
        at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207)
        at
java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197)
        at
java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:136)
        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)
        at
org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92)
        at
org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101)
        at
org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169)
        at
org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179)
        at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1520)
        at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
        at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
        at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
        at
com.media6.SequenceFileLoader.inferReader(SequenceFileLoader.java:140)
        at
com.media6.SequenceFileLoader.determineSchema(SequenceFileLoader.java:106)
        at
org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:148)
        ... 11 more
--

Maybe there's something that needs to be added to SequenceFileLoader to
account for the compressed values, which hadoop's "fs -text" accounts for.
Thanks for any ideas/pointers.

Derek

Reply via email to