Cross site HDFS access using the default.fs.name not possible in Pig
--------------------------------------------------------------------
Key: PIG-940
URL: https://issues.apache.org/jira/browse/PIG-940
Project: Pig
Issue Type: Bug
Components: impl
Affects Versions: 0.3.0
Environment: Hadoop 20
Reporter: Viraj Bhat
Fix For: 0.3.0
I have a script which does the following.. access data from a remote HDFS
location (via a HDFS installed at:hdfs://remotemachine1.company.com/ ) [[as I
do not want to copy this huge amount of data between HDFS locations]].
However I want my Pigscript to write data to the HDFS running on
localmachine.company.com.
Currently Pig does not support that behavior and complains that:
"hdfs://localmachine.company.com/user/viraj/A1.txt does not exist"
{code}
A = LOAD 'hdfs://remotemachine1.company.com/user/viraj/A1.txt' as (a, b);
B = LOAD 'hdfs://remotemachine1.company.com/user/viraj/B1.txt' as (c, d);
C = JOIN A by a, B by c;
store C into 'output' using PigStorage();
{code}
=======================================================================================================================================
2009-09-01 00:37:24,032 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to
hadoop file system at: hdfs://localmachine.company.com:8020
2009-09-01 00:37:24,277 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to
map-reduce job tracker at: localmachine.company.com:50300
2009-09-01 00:37:24,567 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer
- Rewrite: POPackage->POForEach to POJoinPackage
2009-09-01 00:37:24,573 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2009-09-01 00:37:24,573 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
2009-09-01 00:37:26,197 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting up single store job
2009-09-01 00:37:26,249 [Thread-9] WARN org.apache.hadoop.mapred.JobClient -
Use GenericOptionsParser for parsing the arguments. Applications should
implement Tool for the same.
2009-09-01 00:37:26,746 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete
2009-09-01 00:37:26,746 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 100% complete
2009-09-01 00:37:26,747 [main] ERROR
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 1 map reduce job(s) failed!
2009-09-01 00:37:26,756 [main] ERROR
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Failed to produce result in:
"hdfs:/localmachine.company.com/tmp/temp-1470407685/tmp-510854480"
2009-09-01 00:37:26,756 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Failed!
2009-09-01 00:37:26,758 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR
2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist.
Details at logfile: /home/viraj/pigscripts/pig_1251765443851.log
=======================================================================================================================================
The error file in Pig contains:
=======================================================================================================================================
ERROR 2998: Unhandled internal error.
org.apache.pig.backend.executionengine.ExecException: ERROR 2100:
hdfs://localmachine.company.com/user/viraj/A1.txt does not exist.
at
org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:126)
at
org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59)
at
org.apache.pig.impl.io.ValidatingInputFileSpec.<init>(ValidatingInputFileSpec.java:44)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:228)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at
org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException:
ERROR 2100: hdfs://localmachine.company.com/user/viraj/A1.txt does not exist.
at
org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:126)
at
org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59)
at
org.apache.pig.impl.io.ValidatingInputFileSpec.<init>(ValidatingInputFileSpec.java:44)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:228)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at
org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)
=======================================================================================================================================
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.