Re: using s3 as a data source

Dave Viner Mon, 14 Jun 2010 07:00:48 -0700

Here's the stack trace related to that error:

Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backend error:
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to
create input splits for: s3n://my-key:my-skey@
/log/file/path/2010.04.13.20:05:04.log.bz2

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to
open iterator for alias LOGS
at org.apache.pig.PigServer.openIterator(PigServer.java:521)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:357)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997:
Unable to recreate exception from backend error:
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to
create input splits for: s3n://my-key:my-skey@
/log/file/path/2010.04.13.20:05:04.log.bz2
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:268)
at
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:308)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:835)
at org.apache.pig.PigServer.store(PigServer.java:569)
at org.apache.pig.PigServer.openIterator(PigServer.java:504)
... 6 more

After much more experimentation, I discovered that if I copy the file
locally before executing Pig, the script works properly.  That is, I ran:

% /usr/local/hadoop/bin/hadoop dfs -copyToLocal
"s3n:///log/file/path/2010-04-13-20-05-04.log.bz2" test.bz2

Then in pig, read in the file using:
logstest2 = load 'test.bz2' USING PigStorage('\t');

and it worked fine.

One additional problem I discovered, at least for hdfs, is that dfs
-copyToLocal does not work for a file with a ':' in the name.  When I
replaced the ':' with '-', it worked fine.
However, even using the '-' filename, Pig would not open the remote file.

Dave Viner

On Sun, Jun 13, 2010 at 11:09 PM, Ashutosh Chauhan <
[email protected]> wrote:

> Dave,
>
> A log file must be sitting in your dir from where you are running Pig.
> It will contain the stack trace for the failure. Can you paste the
> content of the log file here.
>
> Ashutosh
> On Sun, Jun 13, 2010 at 19:36, Dave Viner <[email protected]> wrote:
> > I'm having trouble using S3 as a data source for files in the LOAD
> > statement.  From research, it definitely appears that I want s3n://, not
> > s3:// because the file was placed there by another (non-hadoop/pig)
> process.
> >  So, here's the basic step:
> >
> > LOGS = LOAD 's3n://my-key:my-skey@
> /log/file/path/2010.04.13.20:05:04.log.bz2'
> > USING PigStorage('\t')
> > dump LOGS;
> >
> > I get this grunt error:
> >
> > org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable
> to
> > create input splits for: s3n://my-key:my-skey@
> > /log/file/path/2010.04.13.20:05:04.log.bz2
> >
> >
> > Is there some other way I can/should specify a file from S3 as the source
> of
> > a LOAD statement?
> >
> > Thanks
> > Dave Viner
> >
>

Re: using s3 as a data source

Reply via email to