Re: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud

Douglas Eadline Mon, 03 Oct 2011 11:19:02 -0700

I think everyone has a similar thoughts, but the presentation
provides some real data and experiences.


BTW, for those interested, I have new poll on ClusterMonkey asking
about clouds and HPC. (http://www.clustermonkey.net/)

The last poll was on GP-GPU use.

--
Doug



> Doug,
>
> Thanks for posting that video. It confirmed what I always suspected
> about clouds for HPC.
>
>
> Prentice
>
> On 10/03/2011 08:25 AM, Douglas Eadline wrote:
>> Interesting and pragmatic HPC cloud presentation, worth watching
>> (25 minutes)
>>
>>  http://insidehpc.com/2011/09/30/video-the-real-future-of-cloud-computing/
>>
>> --
>> Doug
>>
>>>
>>> http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars
>>>
>>> $1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud
>>>
>>> By Jon Brodkin | Published September 20, 2011 10:49 AM
>>>
>>> Amazon EC2 and other cloud services are expanding the market for
>>> high-performance computing. Without access to a national lab or a
>>> supercomputer in your own data center, cloud computing lets businesses
>>> spin
>>> up temporary clusters at will and stop paying for them as soon as the
>>> computing needs are met.
>>>
>>> A vendor called Cycle Computing is on a mission to demonstrate the
>>> potential
>>> of Amazonâs cloud by building increasingly large clusters on the
>>> Elastic
>>> Compute Cloud. Even with Amazon, building a cluster takes some work,
>>> but
>>> Cycle combines several technologies to ease the process and recently
>>> used
>>> them to create a 30,000-core cluster running CentOS Linux.
>>>
>>> The cluster, announced publicly this week, was created for an unnamed
>>> âTop 5
>>> Pharmaâ customer, and ran for about seven hours at the end of July at
>>> a
>>> peak
>>> cost of $1,279 per hour, including the fees to Amazon and Cycle
>>> Computing.
>>> The details are impressive: 3,809 compute instances, each with eight
>>> cores
>>> and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB
>>> (petabytes) of disk space. Security was ensured with HTTPS, SSH and
>>> 256-bit
>>> AES encryption, and the cluster ran across data centers in three Amazon
>>> regions in the United States and Europe. The cluster was dubbed
>>> âNekomata.â
>>>
>>> Spreading the cluster across multiple continents was done partly for
>>> disaster
>>> recovery purposes, and also to guarantee that 30,000 cores could be
>>> provisioned. âWe thought it would improve our probability of success
>>> if
>>> we
>>> spread it out,â Cycle Computingâs Dave Powers, manager of product
>>> engineering, told Ars. âNobody really knows how many instances you
>>> can
>>> get at
>>> any one time from any one [Amazon] region.â
>>>
>>> Amazon offers its own special cluster compute instances, at a higher
>>> cost
>>> than regular-sized virtual machines. These cluster instances provide 10
>>> Gigabit Ethernet networking along with greater CPU and memory, but they
>>> werenât necessary to build the Cycle Computing cluster.
>>>
>>> The pharmaceutical companyâs job, related to molecular modeling, was
>>> âembarrassingly parallelâ so a fast interconnect wasnât crucial.
>>> To
>>> further
>>> reduce costs, Cycle took advantage of Amazonâs low-price âspot
>>> instances.â To
>>> manage the cluster, Cycle Computing used its own management software as
>>> well
>>> as the Condor High-Throughput Computing software and Chef, an open
>>> source
>>> systems integration framework.
>>>
>>> Cycle demonstrated the power of the Amazon cloud earlier this year with
>>> a
>>> 10,000-core cluster built for a smaller pharma firm called Genentech.
>>> Now,
>>> 10,000 cores is a relatively easy task, says Powers. âWe think
>>> weâve
>>> mastered
>>> the small-scale environments,â he said. 30,000 cores isnât the end
>>> game,
>>> either. Going forward, Cycle plans bigger, more complicated clusters,
>>> perhaps
>>> ones that will require Amazonâs special cluster compute instances.
>>>
>>> The 30,000-core cluster may or may not be the biggest one run on EC2.
>>> Amazon
>>> isnât saying.
>>>
>>> âI canât share specific customer details, but can tell you that we
>>> do
>>> have
>>> businesses of all sizes running large-scale, high-performance computing
>>> workloads on AWS [Amazon Web Services], including distributed clusters
>>> like
>>> the Cycle Computing 30,000 core cluster to tightly-coupled clusters
>>> often
>>> used for science and engineering applications such as computational
>>> fluid
>>> dynamics and molecular dynamics simulation,â an Amazon spokesperson
>>> told
>>> Ars.
>>>
>>> Amazon itself actually built a supercomputer on its own cloud that made
>>> it
>>> onto the list of the worldâs Top 500 supercomputers. With 7,000
>>> cores,
>>> the
>>> Amazon cluster ranked number 232 in the world last November with speeds
>>> of
>>> 41.82 teraflops, falling to number 451 in June of this year. So far,
>>> Cycle
>>> Computing hasnât run the Linpack benchmark to determine the speed of
>>> its
>>> clusters relative to Top 500 sites.
>>>
>>> But Cycleâs work is impressive no matter how you measure it. The job
>>> performed for the unnamed pharma company âwould take well over a week
>>> for
>>> them to run internally,â Powers says. In the end, the cluster
>>> performed
>>> the
>>> equivalent of 10.9 âcompute years of work.â
>>>
>>> The task of managing such large cloud-based clusters forced Cycle to
>>> step
>>> up
>>> its own game, with a new plug-in for Chef the company calls Grill.
>>>
>>> âThere is no way that any mere human could keep track of all of the
>>> moving
>>> parts on a cluster of this scale,â Cycle wrote in a blog post. âAt
>>> Cycle,
>>> weâve always been fans of extreme IT automation, but we needed to
>>> take
>>> this
>>> to the next level in order to monitor and manage every instance,
>>> volume,
>>> daemon, job, and so on in order for Nekomata to be an efficient 30,000
>>> core
>>> tool instead of a big shiny on-demand paperweight.â
>>>
>>> But problems did arise during the 30,000-core run.
>>>
>>> âYou can be sure that when you run at massive scale, you are bound to
>>> run
>>> into some unexpected gotchas,â Cycle notes. âIn our case, one of
>>> the
>>> gotchas
>>> included such things as running out of file descriptors on the license
>>> server. In hindsight, we should have anticipated this would be an
>>> issue,
>>> but
>>> we didnât find that in our prelaunch testing, because we didnât
>>> test
>>> at full
>>> scale. We were able to quickly recover from this bump and keep moving
>>> along
>>> with the workload with minimal impact. The license server was able to
>>> keep
>>> up
>>> very nicely with this workload once we increased the number of file
>>> descriptors.â
>>>
>>> Cycle also hit a speed bump related to volume and byte limits on
>>> Amazonâs
>>> Elastic Block Store volumes. But the company is already planning bigger
>>> and
>>> better things.
>>>
>>> âWe already have our next use-case identified and will be turning up
>>> the
>>> scale a bit more with the next run,â the company says. But
>>> ultimately,
>>> âitâs
>>> not about core counts or terabytes of RAM or petabytes of data. Rather,
>>> itâs
>>> about how we are helping to transform how science is done.â
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin
>>> Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>> --
>>> This message has been scanned for viruses and
>>> dangerous content by MailScanner, and is
>>> believed to be clean.
>>>
>>>
>>
>>
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
> _______________________________________________
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>


-- 
Doug

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud

Reply via email to