Derek, please open a ticket on the Jira, I'll check it out. It's probably some trickiness with file bytes vs bytes read. I never tested with compressed input files.
-D On Fri, Feb 19, 2010 at 2:45 PM, Derek Brown <[email protected]>wrote: > I'm having a problem getting the SequenceFileLoader, from the Piggybank, to > read sequence files whose values are block comressed (gzip'd). I'm using > Pig > 0.4.99.0+10, and Hadoop hadoop-0.20.1+152, via Cloudera. > > Did the following: > > * Copied the SequenceFileLoader class into my own project > > * Removed > > public LoadFunc.RequiredFieldResponse > fieldsToRead(LoadFunc.RequiredFieldList requiredFieldList) > > because LoadFunc.RequiredFieldList isn't resolvable, and added > > public void fieldsToRead(Schema schema) > > * Jarred up the .class file > > * Programmatically created a trivial sequence file of a few lines, with > IntWritable keys and Text values, using the basic code in an example in > Hadoop The Definitive Guide > > * That file is successfully read and keys/values displayed, with "hadoop fs > -text", as well as with pig, doing the following: > > grunt> register sequencefileloader.jar; > grunt> r = load '/path/to/sequence_file' using > com.foobar.SequenceFileLoader(); > grunt> dump r; > > * The sequence file with the compressed values is successfully read with > hadoop fs -text > > * When doing the load step in pig with that file, the following results: > > -- > 2010-02-19 16:59:14,489 [main] WARN > org.apache.hadoop.util.NativeCodeLoader > - Unable to load native-hadoop library for your platform.. > . using builtin-java classes where applicable > 2010-02-19 16:59:14,490 [main] INFO > org.apache.hadoop.io.compress.CodecPool > - Got brand-new decompressor > 2010-02-19 16:59:14,498 [main] ERROR org.apache.pig.tools.grunt.Grunt - > ERROR 1018: Problem determining schema during load > Details at logfile: /path/to/pig_1266616744562.log > -- > > That log file contains the following: > > -- > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error > during > parsing. Problem determining schema during load > at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1037) > at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:981) > at org.apache.pig.PigServer.registerQuery(PigServer.java:383) > at > org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:717) > at > > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:273) > at > > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) > at > > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142) > at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) > at org.apache.pig.Main.main(Main.java:363) > Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Problem > determining schema during load > at > > org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:734) > at > > org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) > at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1031) > ... 8 more > Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1018: > Problem determining schema during load > at > org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:155) > at > > org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:732) > ... 10 more > Caused by: java.io.EOFException > at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207) > at > java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197) > at > java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:136) > at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58) > at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68) > at > > org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92) > at > > org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101) > at > > org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169) > at > > org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179) > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1520) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412) > at > com.media6.SequenceFileLoader.inferReader(SequenceFileLoader.java:140) > at > com.media6.SequenceFileLoader.determineSchema(SequenceFileLoader.java:106) > at > org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:148) > ... 11 more > -- > > Maybe there's something that needs to be added to SequenceFileLoader to > account for the compressed values, which hadoop's "fs -text" accounts for. > Thanks for any ideas/pointers. > > Derek >
