Thanks, Dmitriy. I opened the issue on Friday: https://issues.apache.org/jira/browse/PIG-1246
Derek >From: Dmitriy Ryaboy <[email protected]> >To: [email protected] >Date: Fri, 19 Feb 2010 14:51:16 -0800 >Subject: Re: SequenceFileLoader problem with compressed values >Derek, please open a ticket on the Jira, I'll check it out. It's probably >some trickiness with file bytes vs bytes read. I never tested with >compressed input files. > >-D > > >On Fri, Feb 19, 2010 at 2:45 PM, Derek Brown <[email protected]> >wrote: > >> I'm having a problem getting the SequenceFileLoader, from the Piggybank, to >> read sequence files whose values are block comressed (gzip'd). I'm using >> Pig >> 0.4.99.0+10, and Hadoop hadoop-0.20.1+152, via Cloudera. >> >> Did the following: >> >> * Copied the SequenceFileLoader class into my own project >> >> * Removed >> >> public LoadFunc.RequiredFieldResponse >> fieldsToRead(LoadFunc.RequiredFieldList requiredFieldList) >> >> because LoadFunc.RequiredFieldList isn't resolvable, and added >> >> public void fieldsToRead(Schema schema) >> >> * Jarred up the .class file >> >> * Programmatically created a trivial sequence file of a few lines, with >> IntWritable keys and Text values, using the basic code in an example in >> Hadoop The Definitive Guide >> >> * That file is successfully read and keys/values displayed, with "hadoop fs >> -text", as well as with pig, doing the following: >> >> grunt> register sequencefileloader.jar; >> grunt> r = load '/path/to/sequence_file' using >> com.foobar.SequenceFileLoader(); >> grunt> dump r; >> >> * The sequence file with the compressed values is successfully read with >> hadoop fs -text >> >> * When doing the load step in pig with that file, the following results: >> >> -- >> 2010-02-19 16:59:14,489 [main] WARN >> org.apache.hadoop.util.NativeCodeLoader >> - Unable to load native-hadoop library for your platform.. >> . using builtin-java classes where applicable >> 2010-02-19 16:59:14,490 [main] INFO >> org.apache.hadoop.io.compress.CodecPool >> - Got brand-new decompressor >> 2010-02-19 16:59:14,498 [main] ERROR org.apache.pig.tools.grunt.Grunt - >> ERROR 1018: Problem determining schema during load >> Details at logfile: /path/to/pig_1266616744562.log >> -- >> >> That log file contains the following: >> >> -- >> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error >> during >> parsing. Problem determining schema during load >> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1037) >> at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:981) >> at org.apache.pig.PigServer.registerQuery(PigServer.java:383) >> at >> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:717) >> at >> >> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:273) >> at >> >> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166) >> at >> >> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142) >> at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) >> at org.apache.pig.Main.main(Main.java:363) >> Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Problem >> determining schema during load >> at >> >> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:734) >> at >> >> org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63) >> at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1031) >> ... 8 more >> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1018: >> Problem determining schema during load >> at >> org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:155) >> at >> >> org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:732) >> ... 10 more >> Caused by: java.io.EOFException >> at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207) >> at >> java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197) >> at >> java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:136) >> at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58) >> at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68) >> at >> >> org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92) >> at >> >> org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101) >> at >> >> org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169) >> at >> >> org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179) >> at >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1520) >> at >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428) >> at >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417) >> at >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412) >> at >> com.media6.SequenceFileLoader.inferReader(SequenceFileLoader.java:140) >> at >> com.media6.SequenceFileLoader.determineSchema(SequenceFileLoader.java:106) >> at >> org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:148) >> ... 11 more >> -- >> >> Maybe there's something that needs to be added to SequenceFileLoader to >> account for the compressed values, which hadoop's "fs -text" accounts for. >> Thanks for any ideas/pointers. >> >> Derek
