Re: using s3 as a data source

Dave Viner Mon, 14 Jun 2010 08:46:45 -0700

I have redacted the exact path, since i don't want to publish it on a
newsgroup.  But, here's how I actually made the s3n URI:


Go to S3Fox and look up the file I want to test, and extract the full HTTP
URL.  This is something like:
http://log.s3.amazonaws.com/file/path/2010.04.13.20:05:04.log.bz2
In this example 'log' is the name of my bucket.

Then, I replace the http:// with s3n://.  Then I remove the '.
s3.amazonaws.com' from the string.  That results in
s3n://log/file/path/2010.04.13.20:05:04.log.bz2

Then, I add in the key and secret key:
s3n://my-key:my-skey@/log/file/path/2010.04.13.20:05:04.log.bz2

Let me know if there's some other way to form the s3n URI.

Dave Viner


On Mon, Jun 14, 2010 at 8:39 AM, Dan Di Spaltro <[email protected]>wrote:

> aren't you missing the bucket name?
>
> On Mon, Jun 14, 2010 at 7:00 AM, Dave Viner <[email protected]> wrote:
> > Here's the stack trace related to that error:
> >
> > Pig Stack Trace
> > ---------------
> > ERROR 2997: Unable to recreate exception from backend error:
> > org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable
> to
> > create input splits for: s3n://my-key:my-skey@
> > /log/file/path/2010.04.13.20:05:04.log.bz2
> >
> > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to
> > open iterator for alias LOGS
> > at org.apache.pig.PigServer.openIterator(PigServer.java:521)
> > at
> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)
> > at
> >
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
> > at
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
> > at
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
> > at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
> > at org.apache.pig.Main.main(Main.java:357)
> > Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR
> 2997:
> > Unable to recreate exception from backend error:
> > org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable
> to
> > create input splits for: s3n://my-key:my-skey@
> > /log/file/path/2010.04.13.20:05:04.log.bz2
> > at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169)
> > at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:268)
> > at
> >
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:308)
> > at
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:835)
> > at org.apache.pig.PigServer.store(PigServer.java:569)
> > at org.apache.pig.PigServer.openIterator(PigServer.java:504)
> > ... 6 more
> >
> > After much more experimentation, I discovered that if I copy the file
> > locally before executing Pig, the script works properly.  That is, I ran:
> >
> > % /usr/local/hadoop/bin/hadoop dfs -copyToLocal
> > "s3n:///log/file/path/2010-04-13-20-05-04.log.bz2" test.bz2
> >
> > Then in pig, read in the file using:
> > logstest2 = load 'test.bz2' USING PigStorage('\t');
> >
> > and it worked fine.
> >
> > One additional problem I discovered, at least for hdfs, is that dfs
> > -copyToLocal does not work for a file with a ':' in the name.  When I
> > replaced the ':' with '-', it worked fine.
> > However, even using the '-' filename, Pig would not open the remote file.
> >
> > Dave Viner
> >
> > On Sun, Jun 13, 2010 at 11:09 PM, Ashutosh Chauhan <
> > [email protected]> wrote:
> >
> >> Dave,
> >>
> >> A log file must be sitting in your dir from where you are running Pig.
> >> It will contain the stack trace for the failure. Can you paste the
> >> content of the log file here.
> >>
> >> Ashutosh
> >> On Sun, Jun 13, 2010 at 19:36, Dave Viner <[email protected]> wrote:
> >> > I'm having trouble using S3 as a data source for files in the LOAD
> >> > statement.  From research, it definitely appears that I want s3n://,
> not
> >> > s3:// because the file was placed there by another (non-hadoop/pig)
> >> process.
> >> >  So, here's the basic step:
> >> >
> >> > LOGS = LOAD 's3n://my-key:my-skey@
> >> /log/file/path/2010.04.13.20:05:04.log.bz2'
> >> > USING PigStorage('\t')
> >> > dump LOGS;
> >> >
> >> > I get this grunt error:
> >> >
> >> > org.apache.pig.backend.executionengine.ExecException: ERROR 2118:
> Unable
> >> to
> >> > create input splits for: s3n://my-key:my-skey@
> >> > /log/file/path/2010.04.13.20:05:04.log.bz2
> >> >
> >> >
> >> > Is there some other way I can/should specify a file from S3 as the
> source
> >> of
> >> > a LOAD statement?
> >> >
> >> > Thanks
> >> > Dave Viner
> >> >
> >>
> >
>
>
>
> --
> Dan Di Spaltro
>

Re: using s3 as a data source

Reply via email to