On 8/25/06 4:24 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:

> Gian Lorenzo Thione wrote:
>> At Powerset we have used EC2 and Hadoop with a large number of nodes,
>> successfully running Map/Reduce computations and HDFS. Pretty much like you
>> describe, we use HDFS for intermediate results and caching, and periodically
>> extract data to our local network. We are not really using S3 at the moment
>> for persistent storage.
> 
> Why don't you use S3 for persistent storage?  It would seem more
> economical to keep things there, since transfers to and from S3 are
> free, while transferring offsite is rather costly.
> 

We got the system up and running and are experimenting with S3. We'll likely
implement an S3/HDFS native interface. I am assuming that could be of
interest as well :)

>> A nice feature of Hadoop as measured against our use of EC2 has been the
>> capability of fluidly changing the number of instances that are part of the
>> cluster. Our instances are set up to join the cluster and the DFS as soon as
>> they are activated and when - for any reason - we lose those machines, the
>> overall process doesn't suffer. We have been quite happy with this, even at
>> significant number of instances.
> 
> Great to hear!
> 
> It would be useful to hear more about how you build your images.  If
> possible, can you share this on the Hadoop wiki, to provide a reference
> for others?
> 

I'll get back to you on this.

>> As a byproduct of running these experiments, we have implemented some
>> patches to Hadoop to report to the master IP's and hostnames that are
>> different than the default ones assigned by InetAddress's static method
>> (getLocalHost). This is due to the fact that machines sometimes are assigned
>> to specific networks to deal with firewalls in various ways so we may want
>> to report Ips to the JobTracker from different interfaces, in order for the
>> tracker to contact them back. In our model the interface is specified as
>> part of the configuration parameters.
>> 
>> Is this something that the Hadoop project would be interested to
>> incorporate?
> 
> Yes, please.  EC2 seems like a facility that we'd like Hadoop to work
> well on.  It is great resource for folks who don't have the means to
> build and operate their own clusters, but do sometimes need such
> large-scale infrastructure, for, e.g., experiments, research, testing, etc.
> 
> Doug

Ok, great. We'll send a patch and a description soon.


Lorenzo Thione
Founder and Product Architect
Powerset, Inc.
M: (415)812-4667
F: (757)257-9379
[EMAIL PROTECTED]
Skype: lorenzo.thione
AIM: gthione



Reply via email to