I have redacted the exact path, since i don't want to publish it on a newsgroup. But, here's how I actually made the s3n URI:
Go to S3Fox and look up the file I want to test, and extract the full HTTP URL. This is something like: http://log.s3.amazonaws.com/file/path/2010.04.13.20:05:04.log.bz2 In this example 'log' is the name of my bucket. Then, I replace the http:// with s3n://. Then I remove the '. s3.amazonaws.com' from the string. That results in s3n://log/file/path/2010.04.13.20:05:04.log.bz2 Then, I add in the key and secret key: s3n://my-key:my-skey@/log/file/path/2010.04.13.20:05:04.log.bz2 Let me know if there's some other way to form the s3n URI. Dave Viner On Mon, Jun 14, 2010 at 8:39 AM, Dan Di Spaltro <[email protected]>wrote: > aren't you missing the bucket name? > > On Mon, Jun 14, 2010 at 7:00 AM, Dave Viner <[email protected]> wrote: > > Here's the stack trace related to that error: > > > > Pig Stack Trace > > --------------- > > ERROR 2997: Unable to recreate exception from backend error: > > org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable > to > > create input splits for: s3n://my-key:my-skey@ > > /log/file/path/2010.04.13.20:05:04.log.bz2 > > > > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to > > open iterator for alias LOGS > > at org.apache.pig.PigServer.openIterator(PigServer.java:521) > > at > org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544) > > at > > > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241) > > at > > > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) > > at > > > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) > > at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) > > at org.apache.pig.Main.main(Main.java:357) > > Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR > 2997: > > Unable to recreate exception from backend error: > > org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable > to > > create input splits for: s3n://my-key:my-skey@ > > /log/file/path/2010.04.13.20:05:04.log.bz2 > > at > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169) > > at > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:268) > > at > > > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:308) > > at > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:835) > > at org.apache.pig.PigServer.store(PigServer.java:569) > > at org.apache.pig.PigServer.openIterator(PigServer.java:504) > > ... 6 more > > > > After much more experimentation, I discovered that if I copy the file > > locally before executing Pig, the script works properly. That is, I ran: > > > > % /usr/local/hadoop/bin/hadoop dfs -copyToLocal > > "s3n:///log/file/path/2010-04-13-20-05-04.log.bz2" test.bz2 > > > > Then in pig, read in the file using: > > logstest2 = load 'test.bz2' USING PigStorage('\t'); > > > > and it worked fine. > > > > One additional problem I discovered, at least for hdfs, is that dfs > > -copyToLocal does not work for a file with a ':' in the name. When I > > replaced the ':' with '-', it worked fine. > > However, even using the '-' filename, Pig would not open the remote file. > > > > Dave Viner > > > > On Sun, Jun 13, 2010 at 11:09 PM, Ashutosh Chauhan < > > [email protected]> wrote: > > > >> Dave, > >> > >> A log file must be sitting in your dir from where you are running Pig. > >> It will contain the stack trace for the failure. Can you paste the > >> content of the log file here. > >> > >> Ashutosh > >> On Sun, Jun 13, 2010 at 19:36, Dave Viner <[email protected]> wrote: > >> > I'm having trouble using S3 as a data source for files in the LOAD > >> > statement. From research, it definitely appears that I want s3n://, > not > >> > s3:// because the file was placed there by another (non-hadoop/pig) > >> process. > >> > So, here's the basic step: > >> > > >> > LOGS = LOAD 's3n://my-key:my-skey@ > >> /log/file/path/2010.04.13.20:05:04.log.bz2' > >> > USING PigStorage('\t') > >> > dump LOGS; > >> > > >> > I get this grunt error: > >> > > >> > org.apache.pig.backend.executionengine.ExecException: ERROR 2118: > Unable > >> to > >> > create input splits for: s3n://my-key:my-skey@ > >> > /log/file/path/2010.04.13.20:05:04.log.bz2 > >> > > >> > > >> > Is there some other way I can/should specify a file from S3 as the > source > >> of > >> > a LOAD statement? > >> > > >> > Thanks > >> > Dave Viner > >> > > >> > > > > > > -- > Dan Di Spaltro >
