On 8/25/06 4:24 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:
> Gian Lorenzo Thione wrote: >> At Powerset we have used EC2 and Hadoop with a large number of nodes, >> successfully running Map/Reduce computations and HDFS. Pretty much like you >> describe, we use HDFS for intermediate results and caching, and periodically >> extract data to our local network. We are not really using S3 at the moment >> for persistent storage. > > Why don't you use S3 for persistent storage? It would seem more > economical to keep things there, since transfers to and from S3 are > free, while transferring offsite is rather costly. > We got the system up and running and are experimenting with S3. We'll likely implement an S3/HDFS native interface. I am assuming that could be of interest as well :) >> A nice feature of Hadoop as measured against our use of EC2 has been the >> capability of fluidly changing the number of instances that are part of the >> cluster. Our instances are set up to join the cluster and the DFS as soon as >> they are activated and when - for any reason - we lose those machines, the >> overall process doesn't suffer. We have been quite happy with this, even at >> significant number of instances. > > Great to hear! > > It would be useful to hear more about how you build your images. If > possible, can you share this on the Hadoop wiki, to provide a reference > for others? > I'll get back to you on this. >> As a byproduct of running these experiments, we have implemented some >> patches to Hadoop to report to the master IP's and hostnames that are >> different than the default ones assigned by InetAddress's static method >> (getLocalHost). This is due to the fact that machines sometimes are assigned >> to specific networks to deal with firewalls in various ways so we may want >> to report Ips to the JobTracker from different interfaces, in order for the >> tracker to contact them back. In our model the interface is specified as >> part of the configuration parameters. >> >> Is this something that the Hadoop project would be interested to >> incorporate? > > Yes, please. EC2 seems like a facility that we'd like Hadoop to work > well on. It is great resource for folks who don't have the means to > build and operate their own clusters, but do sometimes need such > large-scale infrastructure, for, e.g., experiments, research, testing, etc. > > Doug Ok, great. We'll send a patch and a description soon. Lorenzo Thione Founder and Product Architect Powerset, Inc. M: (415)812-4667 F: (757)257-9379 [EMAIL PROTECTED] Skype: lorenzo.thione AIM: gthione