Re: [ccp4bb] buying a cluster

James Holton Tue, 04 Dec 2018 12:16:56 -0800

Graeme's suggestion of a standard benchmarking dataset is a good one,but I'm not so sure a 24 GB download size is going to get a lot of hits.In fact, in some countries you have to pay for internet by the GB, andit costs as much as mobile phone data! The large size also makes it hardto separate two important aspects of data processing: the CPU vs thedisk. My goal was to isolate the CPU as much as possible, so I made a"standard" image processing benchmark data set that is hyper-compressed:11 MB download size. This is similar to the size of XDS packageitself. This hyper-compression is possible because it is a simulateddata set and that allows me to add noise on the client side. I alsoexpand the data by 10x by repeating the same 360 deg with symboliclinks. The footprint on local disk is only 2.3 GB, so it can usuallyfit into the ramdisk on /dev/shm of most linux systems. The wholehyper-decompression process is done automatically by my XDS and DIALSbenchmarking scripts, or you can just use this one script directly:

http://bl831.als.lbl.gov/~jamesh/benchmarks/get_test_data.com

It will automatically download and decompress the 3600-image test set onmost Linux and Mac systems.

Whether you use my benchmark or not, the important thing about anybenchmark is to run the exact same benchmark on as many machines aspossible, only then can you have enough "controls" to isolate what theimportant features are.

As for GHz, everything "should" scale as GHz unless something else isholding it back, like disk I/O or the network or a myriad of otherthings. The most important factor for data processing, however, hasalways been and still is the last-level CPU cache size. I think thereason multi-socket machines are faster than single-chip multi-coremachines is because all the sockets are really just a way to get morecache into the box.

Data processing is kind of a different animal from most othercrystallographic applications. Perhaps because of the pressure ofmodern detectors many image processing programs have now embraced themulti-CPU revolution. The problem, however, is that the more cores youhave in your processor the lower the GHz will be for a single core. This seems to be a thermal management constraint. What that means is ifyou get a processor with lots and lots of cores you can process datareally fast, but almost all the downstream steps like molecularreplacement and refinement will be slower. In fact, even differentphases of data processing benefit alternately from lots of cores vs lotsof GHz, so ideally you'd like to have two machines: one with a lot ofcores, and the other with a single really fast processor, and alternatebetween these two machines using scripts.

But if you can only get one machine, I currently recommend the XeonW-2155 as a good general-purpose crystallography processor. It has 10cores, so runs DIALS and other multi-CPU programs nicely up to thispuzzling 8-10 core ceiling, but still has the GHz to run single-threadedprograms nice and fast. It's not ridiculously expensive either.


My $1,440 worth,

-James Holton
MAD Scientist


On 12/2/2018 10:56 PM, graeme.win...@diamond.ac.uk wrote:

Re: publishing benchmarks - great idea - expand on what James described earlier.

Most programs are GHz dependent (for most “sensible” definitions of GHz (not
the mega-hyper-pipeline stall prone P4 say) however I see your point that
“threaded” and “optimised for vector systems (e.g. AVX512)” would be very
useful.

I am certainly not advocating that computers > 3 years old should be thrown away
;-) I am one of those folks with a bad hoarding instinct, “it’s good for parts” “it
still works fine” are all in my lexicon. If you are coot-ing and want to refine a
modest structure probably most machines < 10 years old will be fine.

What I was trying to say is that your experience of how fast something is will
depend on your use case, and that the boffins in Santa Clara and Sunnyvale have
not been sitting on their hands this past decade.

Finally, processing “modern” data sets can be a challenge even on fairly hefty
machines - if you pull data04 from https://zenodo.org/record/1443110 you will
find a 3 minute data set [1] which (even with XDS; tweaked for speed script)
can take a long while on a modern-ish machine. 10 year old core2 duo will not
get this done in the same kind of time-frame.

best wishes Graeme

PS kudos to folks for sharing the data online

[1] which would make a fun challenge benchmark :-)

On 3 Dec 2018, at 02:16, Markus Heckmann
<markus.21...@gmail.com<mailto:markus.21...@gmail.com>> wrote:

Hi Graeme,

I suspect that this conclusions depends very closely on (i) the shape of the
problem and (ii) the extent to which the binary has been optimised for the
given platform.

I do hope some of these info are analyzed and either published or at least put
at ccp4 wiki.

I am pretty sure that there are some applications (heavily threaded, making
extensive use of vector operations) which would be massively quicker on 2018
hardware than something a decade old. Certainly though, if you are comparing a
not-highly-optimised single threaded binary then your conclusion is probably a
valid one

I really request all the program developers (in the ccp4bb) to clearly have a
table in the website mentioning if certain program is purely GHz dependent and
not multi-threaded.

Also how much power the machines take to get work done is a non-trivial factor…

But what about the environment? Trashing a decent machine from 2015 for the
latest threadripper2? These old maches have 80-90 + gold power supply. Many
(like Apple's planned obsolescence) are *forcibly* destroyed not refurbished at
all.

Does DIALS run that much quicker? How much time is saved for a phd student in
their career if data processing speeds up from 15 min to 10 min?
Sure perfect for use @synchrotron but otherwise?

May the beamlines/synchrotons should allow for remote data processing and even
refinement. May be all program devs need to put benchmarks - will help users
greatly.

These days i have a feeling science copied the typical electron/website
framework programmers? Programs/website getting fatter not efficient and hoping
everyone has 128GB RAM.

Markus

Cheerio Graeme

On 30 Nov 2018, at 19:32, James Holton
<0000270165b9f4cf-dmarc-requ...@jiscmail.ac.uk<mailto:0000270165b9f4cf-dmarc-requ...@jiscmail.ac.uk>>
wrote:

I have a dissenting opinion about computers "moving on a bit". At least when
it comes to most crystallography software.

Back in the late 20th century I defined some benchmarks for common crystallographic
programs with the aim of deciding which hardware to buy. By about 2003 the champion of
my refmac benchmark (https://bl831.als.lbl.gov/~jamesh/benchmarks/index.html#refmac) was
the new (at the time) AMD "Opteron" at 1.4 GHz. That ran in 74 seconds.

Last year, I bought a rather expensive 4-socket Intel Xeon E7-8870 v3 (turbos
to 3.0 GHz), which is the current champion of my XDS benchmark. The same old
refmac benchmark on this new machine, however, runs in 68.6 seconds. Only a
smidge faster than that old Opteron (which I threw away years ago).

The Xeon X5550 in consideration here takes 74.1 seconds to run this same refmac
benchmark, so price/performance wise I'd say that's not such a bad deal.

The fastest time I have for refmac to date is 41.4 seconds on a Xeon W-2155,
but if you scale by GHz you can see this is mostly due to its fast clock speed
(turbo to 4.5 GHz). With a few notable exceptions like XDS, HKL2k and shelx,
which are multi-processing and optimized to take advantage of the latest
processor features using intel compilers, most crystallographic software is
either written in Python or compiled with gcc. In both these cases you end up
with performance pretty much scaling with GHz. And GHz is heat.

Admittedly, the correlation is not perfect, and software has changed a wee bit
over the years, so comparisons across the decades are not exactly fair, but the
lesson I have learned from all my benchmarking is that single-core raw
performance has not changed much in the last ~10 years or so. Almost all the
speed increase we have seen has come from parallelization.

And one should not be too quick to dismiss clusters in favor of a single box
with a high core count. The latter can be held back by memory contention and
other hard-to-diagnose problems. Even with parallel execution many
crystallography programs don't get any faster beyond using about 8-10 cores.
Don't let 100% utilization fool you! Use a timer and you'll see. I'm not
really sure why that is, but it is the reason that same Xeon W-2155 that leads
my refmac benchmark is also my champion system for running DIALS and
phenix.refine.

My two cents,

-James Holton
MAD Scientist

On 11/26/2018 1:10 AM, V F wrote:

Dear all,
Thanks for all the off/list replies.

To be honest, how much are they paying you to take it? Can you sell it for
scrap?

May be I will give it a pass.

To compare, two dual CPU servers with Skylake Gold 6148 - that is 40 cores -
will probably beat the whole lot even if you could keep the cluster going.
And keeping clusters busy is a time consuming challenge... I know!
If they are 250W servers, then you are looking at £8000 per year to power
and cool it. The two modern servers will be more like £1500 per year to run.
And the servers will only cost about £6000... the economics and planet don't
stack up!

By servers do you mean tower/standalone?

Thanks for the detailed explanation. From 2012, we already have many
dell precision T5600 with 2 x Xeon E5-2643 (8 Cores) (16 threads) and
I was hoping parallellisation with clusters maybe of some help. Looks
not.

These are running so well (takes about 45 min for a typical dataset
reduction with DIALS) I am not sure buying new ones is useful.

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB&A=1

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB&A=1


--
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom


########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB&A=1

________________________________

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB&A=1


########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB&A=1


########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB&A=1

Re: [ccp4bb] buying a cluster

Reply via email to