Re: Using S3 instead of HDFS

Matt Pouttu-Clarke Wed, 18 Jan 2012 08:53:45 -0800

I would strongly suggest using this method to read S3 only.

I have had problems with writing large volumes of data to S3 from Hadoop
using native s3fs.  Supposedly a fix is on the way from Amazon (it is an
undocumented internal error being thrown).  However, this fix is already 2
months later than we expected it and we currently have no ETA.


If you want to write data to S3 reliably, you should use the S3 API
directly and stream data from HDFS into S3.  Just remember that S3
requires the final size of the data before you start writing so it is not
true streaming in that sense.  After you have completed writing your part
files in your job (writing to HDFS), you can write a map-only job to
stream the data up into S3 using the S3 API directly.

In no way, shape, or form should S3 be currently considered as a
replacement for HDFS when it come to writes.  Your jobs will need to be
modified and customized to write to S3 reliably, there are files size
limits on writes, and the multi-part upload option does not work correctly
and randomly throws an internal Amazon error.

You have been warned!

-Matt

On 1/18/12 9:37 AM, "Mark Kerzner" <[email protected]> wrote:

>It worked, thank you, Harsh.
>
>Mark
>
>On Wed, Jan 18, 2012 at 1:16 AM, Harsh J <[email protected]> wrote:
>
>> Ah sorry about missing that. Settings would go in core-site.xml
>> (hdfs-site.xml will no longer be relevant anymore, once you switch to
>>using
>> S3).
>>
>> On 18-Jan-2012, at 12:36 PM, Mark Kerzner wrote:
>>
>> > That wiki page mentiones hadoop-site.xml, but this is old, now you
>>have
>> > core-site.xml and hdfs-site.xml, so which one do you put it in?
>> >
>> > Thank you (and good night Central Time:)
>> >
>> > mark
>> >
>> > On Wed, Jan 18, 2012 at 12:52 AM, Harsh J <[email protected]> wrote:
>> >
>> >> When using S3 you do not need to run any component of HDFS at all. It
>> >> is meant to be an alternate FS choice. You need to run only MR.
>> >>
>> >> The wiki page at http://wiki.apache.org/hadoop/AmazonS3 mentions on
>> >> how to go about specifying your auth details to S3, either directly
>> >> via the fs.default.name URI or via the additional properties
>> >> fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. Does this not work
>> >> for you?
>> >>
>> >> On Wed, Jan 18, 2012 at 12:14 PM, Mark Kerzner <
>> [email protected]>
>> >> wrote:
>> >>> Well, here is my error message
>> >>>
>> >>> Starting Hadoop namenode daemon: starting namenode, logging to
>> >>> /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out
>> >>> ERROR. Could not start Hadoop namenode daemon
>> >>> Starting Hadoop secondarynamenode daemon: starting
>>secondarynamenode,
>> >>> logging to
>> >>>
>> >>
>>
>>/usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26
>>.out
>> >>> Exception in thread "main" java.lang.IllegalArgumentException:
>>Invalid
>> >> URI
>> >>> for NameNode address (check fs.default.name): s3n://myname.testdata
>>is
>> >> not
>> >>> of scheme 'hdfs'.
>> >>>       at
>> >>>
>> >>
>>
>>org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:
>>224)
>> >>>       at
>> >>>
>> >>
>>
>>org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNod
>>e.java:209)
>> >>>       at
>> >>>
>> >>
>>
>>org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(Secon
>>daryNameNode.java:182)
>> >>>       at
>> >>>
>> >>
>>
>>org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.<init>(Secondary
>>NameNode.java:150)
>> >>>       at
>> >>>
>> >>
>>
>>org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNa
>>meNode.java:624)
>> >>> ERROR. Could not start Hadoop secondarynamenode daemon
>> >>>
>> >>> And, if I don't need to start the NameNode, then where do I give
>>the S3
>> >>> credentials?
>> >>>
>> >>> Thank you,
>> >>> Mark
>> >>>
>> >>>
>> >>> On Wed, Jan 18, 2012 at 12:36 AM, Harsh J <[email protected]>
>>wrote:
>> >>>
>> >>>> Hey Mark,
>> >>>>
>> >>>> What is the exact trouble you run into? What do the error messages
>> >>>> indicate?
>> >>>>
>> >>>> This should be definitive enough I think:
>> >>>> http://wiki.apache.org/hadoop/AmazonS3
>> >>>>
>> >>>> On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner <
>> >> [email protected]>
>> >>>> wrote:
>> >>>>> Hi,
>> >>>>>
>> >>>>> whatever I do, I can't make it work, that is, I cannot use
>> >>>>>
>> >>>>> s3://host
>> >>>>>
>> >>>>> or s3n://host
>> >>>>>
>> >>>>> as a replacement for HDFS while runnings EC2 cluster. I change the
>> >>>> settings
>> >>>>> in the core-file.xml, in hdfs-site.xml, and start hadoop services,
>> >> and it
>> >>>>> fails with error messages.
>> >>>>>
>> >>>>> Is there a place where this is clearly described?
>> >>>>>
>> >>>>> Thank you so much.
>> >>>>>
>> >>>>> Mark
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Harsh J
>> >>>> Customer Ops. Engineer, Cloudera
>> >>>>
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >> Customer Ops. Engineer, Cloudera
>> >>
>>
>> --
>> Harsh J
>> Customer Ops. Engineer, Cloudera
>>
>>

________________________________
iCrossing Privileged and Confidential Information
This email message is for the sole use of the intended recipient(s) and may 
contain confidential and privileged information of iCrossing. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please contact the sender by reply email and destroy all 
copies of the original message.

Re: Using S3 instead of HDFS

Reply via email to