Re: SequenceFileLoader problem with compressed values

Derek Brown Mon, 22 Feb 2010 07:09:08 -0800

Thanks, Dmitriy. I opened the issue on Friday:
https://issues.apache.org/jira/browse/PIG-1246


Derek


>From: Dmitriy Ryaboy <[email protected]>
>To: [email protected]
>Date: Fri, 19 Feb 2010 14:51:16 -0800
>Subject: Re: SequenceFileLoader problem with compressed values
>Derek, please open a ticket on the Jira, I'll check it out. It's probably
>some trickiness with file bytes vs bytes read. I never tested with
>compressed input files.
>
>-D
>
>
>On Fri, Feb 19, 2010 at 2:45 PM, Derek Brown <[email protected]>
>wrote:
>
>> I'm having a problem getting the SequenceFileLoader, from the Piggybank,
to
>> read sequence files whose values are block comressed (gzip'd). I'm using
>> Pig
>> 0.4.99.0+10, and Hadoop hadoop-0.20.1+152, via Cloudera.
>>
>> Did the following:
>>
>> * Copied the SequenceFileLoader class into my own project
>>
>> * Removed
>>
>> public LoadFunc.RequiredFieldResponse
>> fieldsToRead(LoadFunc.RequiredFieldList requiredFieldList)
>>
>> because LoadFunc.RequiredFieldList isn't resolvable, and added
>>
>> public void fieldsToRead(Schema schema)
>>
>> * Jarred up the .class file
>>
>> * Programmatically created a trivial sequence file of a few lines, with
>> IntWritable keys and Text values, using the basic code in an example in
>> Hadoop The Definitive Guide
>>
>> * That file is successfully read and keys/values displayed, with "hadoop
fs
>> -text", as well as with pig, doing the following:
>>
>> grunt> register sequencefileloader.jar;
>> grunt> r = load '/path/to/sequence_file' using
>> com.foobar.SequenceFileLoader();
>> grunt> dump r;
>>
>> * The sequence file with the compressed values is successfully read with
>> hadoop fs -text
>>
>> * When doing the load step in pig with that file, the following results:
>>
>> --
>> 2010-02-19 16:59:14,489 [main] WARN
>>  org.apache.hadoop.util.NativeCodeLoader
>> - Unable to load native-hadoop library for your platform..
>> . using builtin-java classes where applicable
>> 2010-02-19 16:59:14,490 [main] INFO
>>  org.apache.hadoop.io.compress.CodecPool
>> - Got brand-new decompressor
>> 2010-02-19 16:59:14,498 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1018: Problem determining schema during load
>> Details at logfile: /path/to/pig_1266616744562.log
>> --
>>
>> That log file contains the following:
>>
>> --
>> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
>> during
>> parsing. Problem determining schema during load
>>        at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1037)
>>        at
org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:981)
>>        at org.apache.pig.PigServer.registerQuery(PigServer.java:383)
>>        at
>> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:717)
>>        at
>>
>>
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:273)
>>        at
>>
>>
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>>        at
>>
>>
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
>>        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
>>        at org.apache.pig.Main.main(Main.java:363)
>> Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException:
Problem
>> determining schema during load
>>        at
>>
>>
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:734)
>>        at
>>
>>
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
>>        at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1031)
>>        ... 8 more
>> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR
1018:
>> Problem determining schema during load
>>        at
>> org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:155)
>>        at
>>
>>
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:732)
>>        ... 10 more
>> Caused by: java.io.EOFException
>>        at
java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207)
>>        at
>> java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197)
>>        at
>> java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:136)
>>        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:58)
>>        at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:68)
>>        at
>>
>>
org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.<init>(GzipCodec.java:92)
>>        at
>>
>>
org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.<init>(GzipCodec.java:101)
>>        at
>>
>>
org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169)
>>        at
>>
>>
org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179)
>>        at
>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1520)
>>        at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>>        at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>>        at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>>        at
>> com.media6.SequenceFileLoader.inferReader(SequenceFileLoader.java:140)
>>        at
>>
com.media6.SequenceFileLoader.determineSchema(SequenceFileLoader.java:106)
>>        at
>> org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:148)
>>        ... 11 more
>> --
>>
>> Maybe there's something that needs to be added to SequenceFileLoader to
>> account for the compressed values, which hadoop's "fs -text" accounts
for.
>> Thanks for any ideas/pointers.
>>
>> Derek

Re: SequenceFileLoader problem with compressed values

Reply via email to