Perhaps we should revisit the implementation of NativeS3FileSystem so
that it doesn't always buffer the file on the client. We could have an
option to make it write directly to S3. Thoughts?

Regarding the problem with HADOOP-3733, you can work around it by
setting fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey in your
hadoop-site.xml.

Those are set, AFAICT. Or at least I see them set as properties in my $HADOOP_HOME/conf/hadoop-site.xml file.

What would be the right syntax for the distcp command, in this case? I'm trying:

hadoop distcp hdfs://<hostname>:50001/<path> s3://<bucket>/<path>

and I get:

java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).

Thanks,

-- Ken


On Fri, May 8, 2009 at 1:17 AM, Andrew Hitchcock <adpow...@gmail.com> wrote:
 Hi Ken,

 S3N doesn't work that well with large files. When uploading a file to
 S3, S3N saves it to local disk during write() and then uploads to S3
 during the close(). Close can take a long time for large files and it
 doesn't report progress, so the call can time out.

 As a work around, I'd recommend either increasing the timeout or
 uploading the files by hand. Since you only have a few large files,
 you might want to copy the files to local disk and then use something
 like s3cmd to upload them to S3.

 Regards,
 Andrew

On Thu, May 7, 2009 at 4:42 PM, Ken Krugler <kkrugler_li...@transpac.com> wrote:
 Hi all,

I have a few large files (4 that are 1.8GB+) I'm trying to copy from HDFS to
 S3. My micro EC2 cluster is running Hadoop 0.19.1, and has one master/two
 slaves.

 I first tried using the hadoop fs -cp command, as in:

 hadoop fs -cp output/<dir>/ s3n://<bucket>/<dir>/

 This seemed to be working, as I could walk the network traffic spike, and
 temp files were being created in S3 (as seen with CyberDuck).

But then it seemed to hang. Nothing happened for 30 minutes, so I killed the
 command.

 Then I tried using the hadoop distcp command, as in:

 hadoop distcp hdfs://<host>:50001/<path>/<dir>/ s3://<public key>:<private
 key>@<bucket>/<dir2>/

 This failed, because my secret key has a '/' in it
 >> (http://issues.apache.org/jira/browse/HADOOP-3733)

[snip]
--
Ken Krugler
+1 530-210-6378

Reply via email to