Reading compressed files in local mode + MiniMRCluster
------------------------------------------------------
Key: PIG-175
URL: https://issues.apache.org/jira/browse/PIG-175
Project: Pig
Issue Type: Bug
Reporter: Craig Macdonald
Attachments: testCompressed.sh
I have written a small test script that tests if three simple compressed and
uncompressed files can be loaded successfully. Essentially, it writes a file,
compresses it using gzip and bzip2, and see if Pig can load it. I use both
local execution mode and miniMR cluster.
Here are my results:
MiniMRCluster
* uncompressed: OK
* gzip: OK
* bzip2: OK
* All three at once: not OK
Local Execution Mode
* uncompressed: OK
* gzip: not OK (garbled output)
* bzip2: not OK ( garbled output)
* All three at once: not OK (expected)
I'm not sure what the problem is with the miniMRcluster - there is a NPE in
PigSplit.getLocations(). I suspect that getFileCacheHints() is returning null,
which ususally indicates a non-existant file.
However, for the local execution mode, I'm fairly confident that this mode has
no support for compressed files.
Craig
{noformat}
==========================================
Bashs good friend: cat
==========================================
Normal
A
B
C
bz2
A
B
C
gzip
A
B
C
==========================================
MiniMRCluster
==========================================
test.all.pig
2008-03-29 12:07:22,103 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to
hadoop file system at: file:///
2008-03-29 12:07:22,241 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics -
Initializing JVM Metrics with processName=JobTracker, sessionId=
2008-03-29 12:07:22,555 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - ----- MapReduce Job
-----
2008-03-29 12:07:22,556 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Input:
[/users/grad/craigm/src/pig/FROMApache/trunk4/trunk/test.normal:org.apache.pig.builtin.PigStorage()]
2008-03-29 12:07:22,556 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map: [[*]]
2008-03-29 12:07:22,556 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Group: null
2008-03-29 12:07:22,556 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Combine: null
2008-03-29 12:07:22,556 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce: null
2008-03-29 12:07:22,556 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Output:
/tmp/temp-1403805719/tmp1733057091:org.apache.pig.builtin.BinStorage
2008-03-29 12:07:22,556 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Split: null
2008-03-29 12:07:22,556 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map parallelism: -1
2008-03-29 12:07:22,557 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce parallelism:
-1
2008-03-29 12:07:23,427 [Thread-0] INFO org.apache.hadoop.mapred.MapTask -
numReduceTasks: 1
2008-03-29 12:07:23,544 [Thread-0] INFO
org.apache.hadoop.mapred.LocalJobRunner -
2008-03-29 12:07:23,545 [Thread-0] INFO org.apache.hadoop.mapred.TaskRunner -
Task 'map_0000' done.
2008-03-29 12:07:23,581 [Thread-0] INFO org.apache.hadoop.mapred.TaskRunner -
Saved output of task 'map_0000' to file:/tmp/temp-1403805719/tmp1733057091
2008-03-29 12:07:23,625 [Thread-0] INFO
org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce
2008-03-29 12:07:23,626 [Thread-0] INFO org.apache.hadoop.mapred.TaskRunner -
Task 'reduce_cibps7' done.
2008-03-29 12:07:23,630 [Thread-0] INFO org.apache.hadoop.mapred.TaskRunner -
Saved output of task 'reduce_cibps7' to file:/tmp/temp-1403805719/tmp1733057091
2008-03-29 12:07:24,383 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher -
Pig progress = 100%
(A)
(B)
(C)
2008-03-29 12:07:24,415 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - ----- MapReduce Job
-----
2008-03-29 12:07:24,415 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Input:
[/user/craigm/test.gz:org.apache.pig.builtin.PigStorage()]
2008-03-29 12:07:24,416 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map: [[*]]
2008-03-29 12:07:24,416 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Group: null
2008-03-29 12:07:24,416 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Combine: null
2008-03-29 12:07:24,416 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce: null
2008-03-29 12:07:24,416 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Output:
/tmp/temp-1403805719/tmp-1191951534:org.apache.pig.builtin.BinStorage
2008-03-29 12:07:24,416 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Split: null
2008-03-29 12:07:24,416 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map parallelism: -1
2008-03-29 12:07:24,417 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce parallelism:
-1
java.lang.NullPointerException
at
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigSplit.getLocations(PigSplit.java:107)
at
org.apache.hadoop.mapred.JobClient.writeSplitsFile(JobClient.java:638)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:540)
at
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher.launchPig(MapReduceLauncher.java:260)
at
org.apache.pig.backend.hadoop.executionengine.POMapreduce.open(POMapreduce.java:176)
at
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:274)
at org.apache.pig.PigServer.openIterator(PigServer.java:314)
at
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:255)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:160)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:63)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:60)
at org.apache.pig.Main.main(Main.java:265)
2008-03-29 12:07:24,868 [main] ERROR org.apache.pig.tools.grunt.Grunt -
java.io.IOException: Unable to open iterator for alias: gz
at
org.apache.pig.impl.util.WrappedIOException.wrap(WrappedIOException.java:16)
at org.apache.pig.PigServer.openIterator(PigServer.java:325)
at
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:255)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:160)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:63)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:60)
at org.apache.pig.Main.main(Main.java:265)
Caused by: org.apache.pig.backend.executionengine.ExecException:
java.io.IOException
at
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:288)
at org.apache.pig.PigServer.openIterator(PigServer.java:314)
... 5 more
Caused by: java.io.IOException
at
org.apache.pig.impl.util.WrappedIOException.wrap(WrappedIOException.java:16)
at
org.apache.pig.impl.util.WrappedIOException.wrap(WrappedIOException.java:12)
at
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher.launchPig(MapReduceLauncher.java:380)
at
org.apache.pig.backend.hadoop.executionengine.POMapreduce.open(POMapreduce.java:176)
at
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:274)
... 6 more
Caused by: java.lang.NullPointerException
at
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigSplit.getLocations(PigSplit.java:107)
at
org.apache.hadoop.mapred.JobClient.writeSplitsFile(JobClient.java:638)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:540)
at
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher.launchPig(MapReduceLauncher.java:260)
... 8 more
2008-03-29 12:07:24,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - Unable
to open iterator for alias: gz
test.bz2.pig
2008-03-29 12:07:25,349 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to
hadoop file system at: file:///
2008-03-29 12:07:25,486 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics -
Initializing JVM Metrics with processName=JobTracker, sessionId=
2008-03-29 12:07:25,761 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - ----- MapReduce Job
-----
2008-03-29 12:07:25,761 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Input:
[/users/grad/craigm/src/pig/FROMApache/trunk4/trunk/test.bz2:org.apache.pig.builtin.PigStorage()]
2008-03-29 12:07:25,761 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map: [[*]]
2008-03-29 12:07:25,762 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Group: null
2008-03-29 12:07:25,762 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Combine: null
2008-03-29 12:07:25,762 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce: null
2008-03-29 12:07:25,762 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Output:
/tmp/temp-142293823/tmp-1682881533:org.apache.pig.builtin.BinStorage
2008-03-29 12:07:25,762 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Split: null
2008-03-29 12:07:25,762 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map parallelism: -1
2008-03-29 12:07:25,762 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce parallelism:
-1
2008-03-29 12:07:26,585 [Thread-0] INFO org.apache.hadoop.mapred.MapTask -
numReduceTasks: 1
2008-03-29 12:07:26,802 [Thread-0] INFO
org.apache.hadoop.mapred.LocalJobRunner -
2008-03-29 12:07:26,802 [Thread-0] INFO org.apache.hadoop.mapred.TaskRunner -
Task 'map_0000' done.
2008-03-29 12:07:26,809 [Thread-0] INFO org.apache.hadoop.mapred.TaskRunner -
Saved output of task 'map_0000' to file:/tmp/temp-142293823/tmp-1682881533
2008-03-29 12:07:26,852 [Thread-0] INFO
org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce
2008-03-29 12:07:26,852 [Thread-0] INFO org.apache.hadoop.mapred.TaskRunner -
Task 'reduce_r75h48' done.
2008-03-29 12:07:26,859 [Thread-0] INFO org.apache.hadoop.mapred.TaskRunner -
Saved output of task 'reduce_r75h48' to file:/tmp/temp-142293823/tmp-1682881533
2008-03-29 12:07:27,547 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher -
Pig progress = 100%
(A)
(B)
(C)
test.gz.pig
2008-03-29 12:07:28,110 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to
hadoop file system at: file:///
2008-03-29 12:07:28,266 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics -
Initializing JVM Metrics with processName=JobTracker, sessionId=
2008-03-29 12:07:28,582 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - ----- MapReduce Job
-----
2008-03-29 12:07:28,583 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Input:
[/users/grad/craigm/src/pig/FROMApache/trunk4/trunk/test.gz:org.apache.pig.builtin.PigStorage()]
2008-03-29 12:07:28,583 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map: [[*]]
2008-03-29 12:07:28,583 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Group: null
2008-03-29 12:07:28,583 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Combine: null
2008-03-29 12:07:28,584 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce: null
2008-03-29 12:07:28,584 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Output:
/tmp/temp-1552662535/tmp1393315176:org.apache.pig.builtin.BinStorage
2008-03-29 12:07:28,584 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Split: null
2008-03-29 12:07:28,584 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map parallelism: -1
2008-03-29 12:07:28,584 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce parallelism:
-1
2008-03-29 12:07:29,621 [Thread-0] INFO org.apache.hadoop.mapred.MapTask -
numReduceTasks: 1
2008-03-29 12:07:29,677 [Thread-0] WARN
org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
2008-03-29 12:07:29,830 [Thread-0] INFO
org.apache.hadoop.mapred.LocalJobRunner -
2008-03-29 12:07:29,831 [Thread-0] INFO org.apache.hadoop.mapred.TaskRunner -
Task 'map_0000' done.
2008-03-29 12:07:29,875 [Thread-0] INFO org.apache.hadoop.mapred.TaskRunner -
Saved output of task 'map_0000' to file:/tmp/temp-1552662535/tmp1393315176
2008-03-29 12:07:30,096 [Thread-0] INFO
org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce
2008-03-29 12:07:30,097 [Thread-0] INFO org.apache.hadoop.mapred.TaskRunner -
Task 'reduce_kan4fo' done.
2008-03-29 12:07:30,103 [Thread-0] INFO org.apache.hadoop.mapred.TaskRunner -
Saved output of task 'reduce_kan4fo' to file:/tmp/temp-1552662535/tmp1393315176
2008-03-29 12:07:30,583 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher -
Pig progress = 100%
(A)
(B)
(C)
test.normal.pig
2008-03-29 12:07:31,114 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to
hadoop file system at: file:///
2008-03-29 12:07:31,270 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics -
Initializing JVM Metrics with processName=JobTracker, sessionId=
2008-03-29 12:07:31,556 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - ----- MapReduce Job
-----
2008-03-29 12:07:31,556 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Input:
[/users/grad/craigm/src/pig/FROMApache/trunk4/trunk/test.normal:org.apache.pig.builtin.PigStorage()]
2008-03-29 12:07:31,556 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map: [[*]]
2008-03-29 12:07:31,556 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Group: null
2008-03-29 12:07:31,557 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Combine: null
2008-03-29 12:07:31,557 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce: null
2008-03-29 12:07:31,557 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Output:
/tmp/temp-323341057/tmp-1104693095:org.apache.pig.builtin.BinStorage
2008-03-29 12:07:31,557 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Split: null
2008-03-29 12:07:31,557 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map parallelism: -1
2008-03-29 12:07:31,557 [main] INFO
org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce parallelism:
-1
2008-03-29 12:07:32,402 [Thread-0] INFO org.apache.hadoop.mapred.MapTask -
numReduceTasks: 1
2008-03-29 12:07:32,514 [Thread-0] INFO
org.apache.hadoop.mapred.LocalJobRunner -
2008-03-29 12:07:32,514 [Thread-0] INFO org.apache.hadoop.mapred.TaskRunner -
Task 'map_0000' done.
2008-03-29 12:07:32,521 [Thread-0] INFO org.apache.hadoop.mapred.TaskRunner -
Saved output of task 'map_0000' to file:/tmp/temp-323341057/tmp-1104693095
2008-03-29 12:07:32,568 [Thread-0] INFO
org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce
2008-03-29 12:07:32,568 [Thread-0] INFO org.apache.hadoop.mapred.TaskRunner -
Task 'reduce_4q573x' done.
2008-03-29 12:07:32,572 [Thread-0] INFO org.apache.hadoop.mapred.TaskRunner -
Saved output of task 'reduce_4q573x' to file:/tmp/temp-323341057/tmp-1104693095
2008-03-29 12:07:33,369 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher -
Pig progress = 100%
(A)
(B)
(C)
==========================================
Local execution mode
==========================================
test.all.pig
(A)
(B)
(C)
(?0?Gs?r?r?s?}8)
(BZh91AY&SY????8 !?h3M???"?(HP??)
test.bz2.pig
(BZh91AY&SY????8 !?h3M???"?(HP??)
test.gz.pig
(?0?Gs?r?r?s?}8)
test.normal.pig
(A)
(B)
(C)
{noformat}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.