Re: Confused about two development utils [EXT]

tomcat/perl Fri, 25 Dec 2020 09:00:08 -0800

Hello James.

Bravo and many thanks for this excellent overview of your activities. Of course the setup(in your previous message) and the activities are very impressive by themselves.But in addition, even though your message is not in itself a perl advocacy message, I feelthat it would have its right place in some perl/mod_perl advocacy forum, because ittouches on some general idea which are valid /also/ for perl and mod_perl.It was very refreshing to read for once a clear exposé of why it is still importantnowadays to think before programming, to program efficiently, and to choose the right toolfor the job at hand (be it perl, mod_perl, or any other) without the kind of off-the-cuffgeneral a-priori which tend to plague these discussions.

And even though our own (commercial) activities and setups do not have anything even closeto the scope which you describe, I would like to say that the same basic principles whichyou mention in your exposé are just as valid when you scale-down as when you scale-up.

("--you can’t just throw memory, CPUs, power at a problem – you have to
think – how can I do what I need to do with the least resources..")

Even when you think of a single server, or a single server rack, at any one period in timethere is always a practical limit as to how much memory or CPUs you can fit in a givenserver, or how many servers you can fit in a rack, or how many additional Gb of bandwidthyou can allocate per server, beyond which there is a sudden "quantum jump" as to howpractical and cost-effective a whole project becomes.In that sense, I particulary enjoyed your examples of the database and of the additionalpower line.



On 24.12.2020 02:38, James Smith wrote:

We don’t use perl for everything, yes we use it for web data, yes we still use it as theglue language in a lot of cases, the most complex stuff is done with C (not even C++ asthat is too slow). Others on site use Python, Java, Rust, Go, PHP, along with looking atusing GPUs in cases where code can be highly parallelised
It is not just one application – but many, many applications… All with a common goal ofunderstanding the human genome, and using it to assist in developing new understanding andtechniques which can advance health care.
We are a very large sequencing centre (one of the largest in the world) – what I waspointing out is that you can’t just throw memory, CPUs, power at a problem – you have tothink – how can I do what I need to do with the least resources. Rather than whatresources can I throw at the problem.
Currently we are acting as the central repository for all COVID-19 sequencing in the UK,along with one of the largest “wet” labs sequencing data for it – and that is half thesequenced samples in the whole world. The UK is sequencing more COVID-19 genomes a daythan most other countries have sequenced since the start of the pandemic in Feb/Mar. Thishas lead to us discovering a new more transmissible version of the virus, and it what partof the country the different strains are present – no other country in the world has theinformation, technology or infrastructure in place to achieve this.
But this is just a small part of the genomic sequencing we are looking at – we 
work on:
* other pathogens – e.g. Plasmodium (Malaria);
* cancer genomes (and how effective drugs are);
* are a major part of the Human Cell Atlas which is looking at how the expression of genes(in the simplest terms which ones are switched on and switched off) are different indifferent tissues;
* sequencing the genomes of other animals to understand their evolution;
* and looking at some other species in detail, to see what we can learn from them whenthey have defective genes;
Although all these are currently scaled back so that we can work relentlessly to supportthe medical teams and other researchers get on top of COVID-19.
What is interesting is that many of the developers we have on campus (well all wfh at themoment) are all (relatively) old as we learnt to develop code on machines with limited CPUand limited memory – so that things had to be efficient, had to be compact…. And that isas important now as it was 20 or 30 years ago – the data we handle is going up faster thanMoore’s Law! Many of us have pride in doing things as efficiently as possible.
It took around 10 years to sequence and assemble the first human genome {well we are stilltinkering with it and filling in the gaps} – now at the institute we can sequence andassemble around 400 human genomes in a day – to the same quality!
So most of our issues are due to the scale of the problems we face – e.g. the human genomehas 3 billion base-pairs (A, C, G, Ts) , so normal solutions don’t scale to that (oncemany years ago we looked at setting up an Oracle database where there was at least 1 rowfor every base pair – recording all variants (think of them as spelling mistakes, forexample a T rather than an A, or an extra letter inserted or deleted) for that base pair…The schema was set up – and then they realised it would take 12 months to load the datawhich we had then (which is probably less than a millionth of what we have now)!
Moving compute off site is a problem as the transfer of the level of data we have wouldcause a problem – you can’t easily move all the data to the compute – so you have to bringthe compute to the data.
The site I worked on before I became a more general developer was doing that – and thecode that was written 12-15 years ago is actually still going strong – it has seen a fewchanges over the year – many displays have had to be redeveloped as the scale of the datahas got so big that even the summary pages we produced 10 years ago have to be summarisedbecause they are so large.
*From:*Mithun Bhattacharya <[email protected]>
*Sent:* 24 December 2020 00:06
*To:* mod_perl list <[email protected]>
*Subject:* Re: Confused about two development utils [EXT]

James would you be able to share more info about your setup ?
1. What exactly is your application doing which requires so much memory and CPU - is itsomething like gene splicing (no i don't know much about it beyond Jurassic Park :D )
2. Do you feel Perl was the best choice for whatever you are doing and if yes then why ?How much of your stuff is using mod_perl considering you mentioned not much is web related ?
3. What are the challenges you are currently facing with your implementation ?
On Wed, Dec 23, 2020 at 6:58 AM James Smith <[email protected] <mailto:[email protected]>>wrote:
    Oh but memory is a problem – but not if you have just a small cluster of 
machines!

    Our boxes are larger than that – but they all run virtual machine {only a 
small
    proportion web related} – machines/memory would rapidly become in our data 
centre - we
    run VMWARE [995 hosts] and openstack [10,000s of hosts] + a selection of 
large memory
    machines {measured in TBs of memory per machine }.

    We would be looking at somewhere between 0.5 PB and 1 PB of memory – not 
just the
    price of buying that amount of memory - for many machines we need the 
fastest memory
    money can buy for the workload, but we would need a lot more CPUs then we 
currently
    have as we would need a larger amount of machines to have 64GB virtual 
machines {we
    would get 2 VMs per host. We currently have approx. 1-2000 CPUs running our 
hardware
    (last time I had a figure) – it would probably need to go to approximately 
5-10,000!
    It is not just the initial outlay but the environmental and financial cost 
of running
    that number of machines, and finding space to run them without putting the 
cooling
    costs through the roof!! That is without considering what additional 
constraints on
    storage having the extra machines may have (at the last count a year ago we 
had over
    30 PBytes of storage on side – and a large amount of offsite backup.

    We would also stretch the amount of power we can get from the national grid 
to power
    it all - we currently have 3 feeds from different part of the national grid 
(we are
    fortunately in position where this is possible) and the dedicated link we 
would need
    to add more power would be at least 50 miles long!

    So - managing cores/memory is vitally important to us – moving to the cloud 
is an
    option we are looking at – but that is more than 4 times the price of our 
onsite
    set-up (with substantial discounts from AWS) and would require an upgrade 
of our
    existing link to the internet – which is currently 40Gbit of data (I think).

    Currently we are analysing a very large amounts of data directly linked to 
the current
    major world problem – this is why the UK is currently being isolated as we 
have
    discovered and can track a new strain, in near real time – other countries 
have no
    ability to do this – we in a day can and do handle, sequence and analyse 
more samples
    than the whole of France has sequenced since February. We probably don’t 
have more of
    the new variant strain than in other areas of the world – it is just that 
we know we
    have because of the amount of sequencing and analysis that we in the UK 
have done.

    *From:*Matthias Peng <[email protected] 
<mailto:[email protected]>>
    *Sent:* 23 December 2020 12:02
    *To:* mod_perl list <[email protected] 
<mailto:[email protected]>>
    *Subject:* Re: Confused about two development utils [EXT]

    Today memory is not serious problem, each of our server has 64GB memory.


        Forgot to add - so our FCGI servers need a lot (and I mean a lot) more 
memory than
        the mod_perl servers to serve the same level of content (just in case 
memory blows
        up with FCGI backends)

        -----Original Message-----
        From: James Smith <[email protected] <mailto:[email protected]>>
        Sent: 23 December 2020 11:34
        To: André Warnier (tomcat/perl) <[email protected] 
<mailto:[email protected]>>;
        [email protected] <mailto:[email protected]>
        Subject: RE: Confused about two development utils [EXT]


         > This costs memory, and all the more since many perl modules are not
        thread-safe, so if you use them in your code, at this moment the only 
safe way to
        do it is to use the Apache httpd prefork model. This means that each 
Apache httpd
        child process has its own copy of the perl interpreter, which means 
that the
        memory used by this embedded perl interpreter has to be counted n times 
(as many
        times as there are Apache httpd child processes running at any one 
time).

        This isn’t quite true - if you load modules before the process forks 
then they can
        cleverly share the same parts of memory. It is useful to be able to 
"pre-load"
        core functionality which is used across all functions {this is the case 
in Linux
        anyway}. It also speeds up child process generation as the modules are 
already in
        memory and converted to byte code.

        One of the great advantages of mod_perl is Apache2::SizeLimit which can 
blow away
        large child process - and then if needed create new ones. This is not 
the case
        with some of the FCGI solutions as the individual processes can grow if 
there is a
        memory leak or a request that retrieves a large amount of content (even 
if not
        served), but perl can't give the memory back. So FCGI processes only 
get bigger
        and bigger and eventually blow up memory (or hit swap first)





        --
          The Wellcome Sanger Institute is operated by Genome Research  
Limited, a charity
        registered in England with number 1021457 and a  company registered in 
England
        with number 2742969, whose registered  office is 215 Euston Road, 
London, NW1 2
        [google.com]
        
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=>BE.
-- The Wellcome Sanger Institute is operated by Genome Research
          Limited, a charity registered in England with number 1021457 and a
          company registered in England with number 2742969, whose registered
          office is 215 Euston Road, London, NW1 2 [google.com]
        
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.com_maps_search_s-2B215-2BEuston-2BRoad-2C-2BLondon-2C-2BNW1-2B2-3Fentry-3Dgmail-26source-3Dg&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=oH2yp0ge1ecj4oDX0XM7vQ&m=friR8ykiZ-NWYdX6SrbT_ogNXEVR-4ixdkrhy5khQjA&s=xU3F4xE2ugQuDWHZ4GtDn9mPBCKcJJOI0PYScsSNjSg&e=>BE.

    -- The Wellcome Sanger Institute is operated by Genome Research Limited, a 
charity
    registered in England with number 1021457 and a company registered in 
England with
    number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charityregistered in England with number 1021457 and a company registered in England with number2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

Re: Confused about two development utils [EXT]

Reply via email to