Reading compressed files in local mode + MiniMRCluster
------------------------------------------------------

                 Key: PIG-175
                 URL: https://issues.apache.org/jira/browse/PIG-175
             Project: Pig
          Issue Type: Bug
            Reporter: Craig Macdonald
         Attachments: testCompressed.sh

I have written a small test script that tests if three simple compressed and 
uncompressed files can be loaded successfully. Essentially, it writes a file, 
compresses it using gzip and bzip2, and see if Pig can load it. I use both 
local execution mode and miniMR cluster.

Here are my results:
MiniMRCluster
 * uncompressed: OK
 * gzip: OK
 * bzip2: OK
 * All three at once: not OK

Local Execution Mode
 * uncompressed: OK
 * gzip: not OK (garbled output)
 * bzip2: not OK ( garbled output)
 * All three at once: not OK (expected)

I'm not sure what the problem is with the miniMRcluster - there is a NPE in 
PigSplit.getLocations(). I suspect that getFileCacheHints() is returning null, 
which ususally indicates a non-existant file. 

However, for the local execution mode, I'm fairly confident that this mode has 
no support for compressed files.

Craig

{noformat}
==========================================
Bashs good friend: cat
==========================================
Normal
A
B
C
bz2
A
B
C
gzip
A
B
C
==========================================
MiniMRCluster
==========================================
test.all.pig
2008-03-29 12:07:22,103 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: file:///
2008-03-29 12:07:22,241 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - 
Initializing JVM Metrics with processName=JobTracker, sessionId=
2008-03-29 12:07:22,555 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - ----- MapReduce Job 
-----
2008-03-29 12:07:22,556 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Input: 
[/users/grad/craigm/src/pig/FROMApache/trunk4/trunk/test.normal:org.apache.pig.builtin.PigStorage()]
2008-03-29 12:07:22,556 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map: [[*]]
2008-03-29 12:07:22,556 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Group: null
2008-03-29 12:07:22,556 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Combine: null
2008-03-29 12:07:22,556 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce: null
2008-03-29 12:07:22,556 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Output: 
/tmp/temp-1403805719/tmp1733057091:org.apache.pig.builtin.BinStorage
2008-03-29 12:07:22,556 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Split: null
2008-03-29 12:07:22,556 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map parallelism: -1
2008-03-29 12:07:22,557 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce parallelism: 
-1
2008-03-29 12:07:23,427 [Thread-0] INFO  org.apache.hadoop.mapred.MapTask - 
numReduceTasks: 1
2008-03-29 12:07:23,544 [Thread-0] INFO  
org.apache.hadoop.mapred.LocalJobRunner -
2008-03-29 12:07:23,545 [Thread-0] INFO  org.apache.hadoop.mapred.TaskRunner - 
Task 'map_0000' done.
2008-03-29 12:07:23,581 [Thread-0] INFO  org.apache.hadoop.mapred.TaskRunner - 
Saved output of task 'map_0000' to file:/tmp/temp-1403805719/tmp1733057091
2008-03-29 12:07:23,625 [Thread-0] INFO  
org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce
2008-03-29 12:07:23,626 [Thread-0] INFO  org.apache.hadoop.mapred.TaskRunner - 
Task 'reduce_cibps7' done.
2008-03-29 12:07:23,630 [Thread-0] INFO  org.apache.hadoop.mapred.TaskRunner - 
Saved output of task 'reduce_cibps7' to file:/tmp/temp-1403805719/tmp1733057091
2008-03-29 12:07:24,383 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher - 
Pig progress = 100%
(A)
(B)
(C)
2008-03-29 12:07:24,415 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - ----- MapReduce Job 
-----
2008-03-29 12:07:24,415 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Input: 
[/user/craigm/test.gz:org.apache.pig.builtin.PigStorage()]
2008-03-29 12:07:24,416 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map: [[*]]
2008-03-29 12:07:24,416 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Group: null
2008-03-29 12:07:24,416 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Combine: null
2008-03-29 12:07:24,416 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce: null
2008-03-29 12:07:24,416 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Output: 
/tmp/temp-1403805719/tmp-1191951534:org.apache.pig.builtin.BinStorage
2008-03-29 12:07:24,416 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Split: null
2008-03-29 12:07:24,416 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map parallelism: -1
2008-03-29 12:07:24,417 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce parallelism: 
-1
java.lang.NullPointerException
        at 
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigSplit.getLocations(PigSplit.java:107)
        at 
org.apache.hadoop.mapred.JobClient.writeSplitsFile(JobClient.java:638)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:540)
        at 
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher.launchPig(MapReduceLauncher.java:260)
        at 
org.apache.pig.backend.hadoop.executionengine.POMapreduce.open(POMapreduce.java:176)
        at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:274)
        at org.apache.pig.PigServer.openIterator(PigServer.java:314)
        at 
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:255)
        at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:160)
        at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:63)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:60)
        at org.apache.pig.Main.main(Main.java:265)
2008-03-29 12:07:24,868 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
java.io.IOException: Unable to open iterator for alias: gz
        at 
org.apache.pig.impl.util.WrappedIOException.wrap(WrappedIOException.java:16)
        at org.apache.pig.PigServer.openIterator(PigServer.java:325)
        at 
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:255)
        at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:160)
        at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:63)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:60)
        at org.apache.pig.Main.main(Main.java:265)
Caused by: org.apache.pig.backend.executionengine.ExecException: 
java.io.IOException
        at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:288)
        at org.apache.pig.PigServer.openIterator(PigServer.java:314)
        ... 5 more
Caused by: java.io.IOException
        at 
org.apache.pig.impl.util.WrappedIOException.wrap(WrappedIOException.java:16)
        at 
org.apache.pig.impl.util.WrappedIOException.wrap(WrappedIOException.java:12)
        at 
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher.launchPig(MapReduceLauncher.java:380)
        at 
org.apache.pig.backend.hadoop.executionengine.POMapreduce.open(POMapreduce.java:176)
        at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:274)
        ... 6 more
Caused by: java.lang.NullPointerException
        at 
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigSplit.getLocations(PigSplit.java:107)
        at 
org.apache.hadoop.mapred.JobClient.writeSplitsFile(JobClient.java:638)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:540)
        at 
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher.launchPig(MapReduceLauncher.java:260)
        ... 8 more

2008-03-29 12:07:24,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - Unable 
to open iterator for alias: gz
test.bz2.pig
2008-03-29 12:07:25,349 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: file:///
2008-03-29 12:07:25,486 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - 
Initializing JVM Metrics with processName=JobTracker, sessionId=
2008-03-29 12:07:25,761 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - ----- MapReduce Job 
-----
2008-03-29 12:07:25,761 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Input: 
[/users/grad/craigm/src/pig/FROMApache/trunk4/trunk/test.bz2:org.apache.pig.builtin.PigStorage()]
2008-03-29 12:07:25,761 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map: [[*]]
2008-03-29 12:07:25,762 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Group: null
2008-03-29 12:07:25,762 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Combine: null
2008-03-29 12:07:25,762 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce: null
2008-03-29 12:07:25,762 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Output: 
/tmp/temp-142293823/tmp-1682881533:org.apache.pig.builtin.BinStorage
2008-03-29 12:07:25,762 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Split: null
2008-03-29 12:07:25,762 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map parallelism: -1
2008-03-29 12:07:25,762 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce parallelism: 
-1
2008-03-29 12:07:26,585 [Thread-0] INFO  org.apache.hadoop.mapred.MapTask - 
numReduceTasks: 1
2008-03-29 12:07:26,802 [Thread-0] INFO  
org.apache.hadoop.mapred.LocalJobRunner -
2008-03-29 12:07:26,802 [Thread-0] INFO  org.apache.hadoop.mapred.TaskRunner - 
Task 'map_0000' done.
2008-03-29 12:07:26,809 [Thread-0] INFO  org.apache.hadoop.mapred.TaskRunner - 
Saved output of task 'map_0000' to file:/tmp/temp-142293823/tmp-1682881533
2008-03-29 12:07:26,852 [Thread-0] INFO  
org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce
2008-03-29 12:07:26,852 [Thread-0] INFO  org.apache.hadoop.mapred.TaskRunner - 
Task 'reduce_r75h48' done.
2008-03-29 12:07:26,859 [Thread-0] INFO  org.apache.hadoop.mapred.TaskRunner - 
Saved output of task 'reduce_r75h48' to file:/tmp/temp-142293823/tmp-1682881533
2008-03-29 12:07:27,547 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher - 
Pig progress = 100%
(A)
(B)
(C)
test.gz.pig
2008-03-29 12:07:28,110 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: file:///
2008-03-29 12:07:28,266 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - 
Initializing JVM Metrics with processName=JobTracker, sessionId=
2008-03-29 12:07:28,582 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - ----- MapReduce Job 
-----
2008-03-29 12:07:28,583 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Input: 
[/users/grad/craigm/src/pig/FROMApache/trunk4/trunk/test.gz:org.apache.pig.builtin.PigStorage()]
2008-03-29 12:07:28,583 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map: [[*]]
2008-03-29 12:07:28,583 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Group: null
2008-03-29 12:07:28,583 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Combine: null
2008-03-29 12:07:28,584 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce: null
2008-03-29 12:07:28,584 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Output: 
/tmp/temp-1552662535/tmp1393315176:org.apache.pig.builtin.BinStorage
2008-03-29 12:07:28,584 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Split: null
2008-03-29 12:07:28,584 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map parallelism: -1
2008-03-29 12:07:28,584 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce parallelism: 
-1
2008-03-29 12:07:29,621 [Thread-0] INFO  org.apache.hadoop.mapred.MapTask - 
numReduceTasks: 1
2008-03-29 12:07:29,677 [Thread-0] WARN  
org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
2008-03-29 12:07:29,830 [Thread-0] INFO  
org.apache.hadoop.mapred.LocalJobRunner -
2008-03-29 12:07:29,831 [Thread-0] INFO  org.apache.hadoop.mapred.TaskRunner - 
Task 'map_0000' done.
2008-03-29 12:07:29,875 [Thread-0] INFO  org.apache.hadoop.mapred.TaskRunner - 
Saved output of task 'map_0000' to file:/tmp/temp-1552662535/tmp1393315176
2008-03-29 12:07:30,096 [Thread-0] INFO  
org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce
2008-03-29 12:07:30,097 [Thread-0] INFO  org.apache.hadoop.mapred.TaskRunner - 
Task 'reduce_kan4fo' done.
2008-03-29 12:07:30,103 [Thread-0] INFO  org.apache.hadoop.mapred.TaskRunner - 
Saved output of task 'reduce_kan4fo' to file:/tmp/temp-1552662535/tmp1393315176
2008-03-29 12:07:30,583 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher - 
Pig progress = 100%
(A)
(B)
(C)
test.normal.pig
2008-03-29 12:07:31,114 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: file:///
2008-03-29 12:07:31,270 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - 
Initializing JVM Metrics with processName=JobTracker, sessionId=
2008-03-29 12:07:31,556 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - ----- MapReduce Job 
-----
2008-03-29 12:07:31,556 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Input: 
[/users/grad/craigm/src/pig/FROMApache/trunk4/trunk/test.normal:org.apache.pig.builtin.PigStorage()]
2008-03-29 12:07:31,556 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map: [[*]]
2008-03-29 12:07:31,556 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Group: null
2008-03-29 12:07:31,557 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Combine: null
2008-03-29 12:07:31,557 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce: null
2008-03-29 12:07:31,557 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Output: 
/tmp/temp-323341057/tmp-1104693095:org.apache.pig.builtin.BinStorage
2008-03-29 12:07:31,557 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Split: null
2008-03-29 12:07:31,557 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map parallelism: -1
2008-03-29 12:07:31,557 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce parallelism: 
-1
2008-03-29 12:07:32,402 [Thread-0] INFO  org.apache.hadoop.mapred.MapTask - 
numReduceTasks: 1
2008-03-29 12:07:32,514 [Thread-0] INFO  
org.apache.hadoop.mapred.LocalJobRunner -
2008-03-29 12:07:32,514 [Thread-0] INFO  org.apache.hadoop.mapred.TaskRunner - 
Task 'map_0000' done.
2008-03-29 12:07:32,521 [Thread-0] INFO  org.apache.hadoop.mapred.TaskRunner - 
Saved output of task 'map_0000' to file:/tmp/temp-323341057/tmp-1104693095
2008-03-29 12:07:32,568 [Thread-0] INFO  
org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce
2008-03-29 12:07:32,568 [Thread-0] INFO  org.apache.hadoop.mapred.TaskRunner - 
Task 'reduce_4q573x' done.
2008-03-29 12:07:32,572 [Thread-0] INFO  org.apache.hadoop.mapred.TaskRunner - 
Saved output of task 'reduce_4q573x' to file:/tmp/temp-323341057/tmp-1104693095
2008-03-29 12:07:33,369 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher - 
Pig progress = 100%
(A)
(B)
(C)
==========================================
Local execution mode
==========================================
test.all.pig
(A)
(B)
(C)
(?0?Gs?r?r?s?}8)
(BZh91AY&SY????8 !?h3M???"?(HP??)
test.bz2.pig
(BZh91AY&SY????8 !?h3M???"?(HP??)
test.gz.pig
(?0?Gs?r?r?s?}8)
test.normal.pig
(A)
(B)
(C)

{noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to