Re: [ccp4bb] Another folding AI

2022-11-07 Thread Nave, Colin (DLSLtd,RAL,LSCI)
Charlie
What you say is very disturbing. I can’t see how you can do science at all 
without having a model. It is most likely that the people you refer to have a 
model but don’t realise it. If this is the case, the model is likely to be a 
very poor one. A model should be a useful simplification of a complex system in 
which both the parameters of the model and the values assigned to these 
parameters are clear.

A good model of a 5G mobile phone mast should be able to predict the amount of 
cancers it is likely to cause. However, politicians who oppose the construction 
of the phone mast (in order to gain votes) will argue that scientists can’t 
prove they don’t cause cancer. This happened in the UK (in Bath) and of course 
the politician was formally correct but their model meant one should not do 
anything.

For those still doubtful about the what a model is see, for example,
https://utw10426.utweb.utexas.edu/Topics/Models/Text.html
though of course there are many other sources

Still all good fun to discuss.
Colin




From: Carter, Charlie 
Sent: 06 November 2022 21:19
To: Nave, Colin (DLSLtd,RAL,LSCI) 
Cc: CCP4BB@jiscmail.ac.uk
Subject: Re: [ccp4bb] Another folding AI

Colin,

A former graduate student of mine alerted me to Box GEP, Hunter WG, Hunter JS. 
1978. Statistics for Experimenters. New York: Wiley Interscience in ~1984. I 
bought a new copy then and have very likely spent more time inside that book 
than any other over the years since then. So I’m entirely in your court about 
Box’s contributions.

On the other hand, as I diverged from crystallography into mechanistic 
enzymology and other areas, I began to realize that comments such as those you 
cited also have had the adverse affect of persuading people not to build models 
at all and even to induce real skepticism about building and testing models. I 
find that a shame.

Charlie


On Nov 6, 2022, at 3:12 PM, Nave, Colin (DLSLtd,RAL,LSCI) 
<64fdcfc6624b-dmarc-requ...@jiscmail.ac.uk<mailto:64fdcfc6624b-dmarc-requ...@jiscmail.ac.uk>>
 wrote:

You don't often get email from 
64fdcfc6624b-dmarc-requ...@jiscmail.ac.uk<mailto:64fdcfc6624b-dmarc-requ...@jiscmail.ac.uk>.
 Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
All these quotes are great fun and worth keeping in mind, along with many 
others. I hadn’t realised that the creationists had adopted James Clark Maxwell 
as one of their own. However he didn’t go as far as one present day scientist 
who believes the earth is no more than 10,000 years old. Which discipline? 
Palaeontology of course. This person is not even some old fossil.

I always liked George Box’s comments about models. I remember Eleanor Dodson 
saying, probably at some CCP4 refinement study weekend, that a structure 
obtained by protein crystallography was like a curate’s egg. She might have 
predated George Box with this thought.

There should not  be any doubt that AF2 models are useful though. The question 
is how far their usefulness extends.


From: CCP4 bulletin board mailto:CCP4BB@JISCMAIL.AC.UK>> 
On Behalf Of Bryan Lepore
Sent: 05 November 2022 17:03
To: CCP4BB@JISCMAIL.AC.UK<mailto:CCP4BB@JISCMAIL.AC.UK>
Subject: Re: [ccp4bb] Another folding AI

And of course,

"... all models are approximations. Essentially, all models are wrong, but some 
are useful. However, the approximate nature of the model must always be borne 
in mind"

Box, G. E. P.; Draper, N. R. (1987)
Empirical Model-Building and Response Surfaces
John Wiley & Sons

"Since all models are wrong the scientist must be alert to what is importantly 
wrong. It is inappropriate to be concerned about mice when there are tigers 
abroad."

Box, George E. P. (1976)
"Science and 
statistics"<http://www-sop.inria.fr/members/Ian.Jermyn/philosophy/writings/Boxonmaths.pdf>
Journal of the American Statistical 
Association<https://en.m.wikipedia.org/wiki/Journal_of_the_American_Statistical_Association>
71 (356): 791-799
doi<https://en.m.wikipedia.org/wiki/Doi_(identifier)>:10.1080/01621459.1976.10480949<https://doi.org/10.1080%2F01621459.1976.10480949>

https://en.m.wikipedia.org/wiki/All_models_are_wrong


To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1

--
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any a

Re: [ccp4bb] Another folding AI

2022-11-07 Thread Nave, Colin (DLSLtd,RAL,LSCI)
Hi Ian
Yes, I was suspicious about the Maxwell quote in creation.com as it seemed to 
be referring to molecules rather than life forms. Of course proteins are 
molecules too but the chemical mechanism of evolution in living systems was a 
mystery to both Darwin and Maxwell. Darwin felt he had to hide his lack of 
religious belief whereas Maxwell did embrace religion. Nothing wrong with that 
in my view. I always think Maxwell’s status as a scientist has been largely 
ignored by the general public in the UK. He introduced his electromagnetic 
theory mainly talking about the work of others and finishing with saying he did 
have an alternative theory.
Colin


From: CCP4 bulletin board  On Behalf Of Ian Tickle
Sent: 06 November 2022 20:43
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] Another folding AI


Hi Colin

That seems to be a fundamental misinterpretation of what Maxwell actually said 
(and being half Scot myself I feel obliged to defend him!).  To add one more 
quote to the mix, that of Charles Petzold in his review of the historical 
evidence for the claim of the creationists:

Abstract: Modern-day Christian creationists have claimed nineteenth-century 
Scottish physicist James Clerk Maxwell as one of their own. The author explores 
the historical evidence and discovers that Maxwell's attitude toward evolution 
was more nuanced than people assume. The characterization of Maxwell as an 
anti-Darwin creationist is based on exaggerations, misinterpretations, and 
(probably most of all) wishful thinking.

https://www.charlespetzold.com/etc/MaxwellMoleculesAndEvolution.html

Cheers

-- Ian


On Sun, 6 Nov 2022 at 20:13, Nave, Colin (DLSLtd,RAL,LSCI) 
<64fdcfc6624b-dmarc-requ...@jiscmail.ac.uk<mailto:64fdcfc6624b-dmarc-requ...@jiscmail.ac.uk>>
 wrote:
All these quotes are great fun and worth keeping in mind, along with many 
others. I hadn’t realised that the creationists had adopted James Clark Maxwell 
as one of their own. However he didn’t go as far as one present day scientist 
who believes the earth is no more than 10,000 years old. Which discipline? 
Palaeontology of course. This person is not even some old fossil.

I always liked George Box’s comments about models. I remember Eleanor Dodson 
saying, probably at some CCP4 refinement study weekend, that a structure 
obtained by protein crystallography was like a curate’s egg. She might have 
predated George Box with this thought.

There should not  be any doubt that AF2 models are useful though. The question 
is how far their usefulness extends.


From: CCP4 bulletin board mailto:CCP4BB@JISCMAIL.AC.UK>> 
On Behalf Of Bryan Lepore
Sent: 05 November 2022 17:03
To: CCP4BB@JISCMAIL.AC.UK<mailto:CCP4BB@JISCMAIL.AC.UK>
Subject: Re: [ccp4bb] Another folding AI


And of course,



"... all models are approximations. Essentially, all models are wrong, but some 
are useful. However, the approximate nature of the model must always be borne 
in mind"



Box, G. E. P.; Draper, N. R. (1987)

Empirical Model-Building and Response Surfaces

John Wiley & Sons



"Since all models are wrong the scientist must be alert to what is importantly 
wrong. It is inappropriate to be concerned about mice when there are tigers 
abroad."



Box, George E. P. (1976)

"Science and 
statistics"<http://www-sop.inria.fr/members/Ian.Jermyn/philosophy/writings/Boxonmaths.pdf>

Journal of the American Statistical 
Association<https://en.m.wikipedia.org/wiki/Journal_of_the_American_Statistical_Association>

71 (356): 791-799

doi<https://en.m.wikipedia.org/wiki/Doi_(identifier)>:10.1080/01621459.1976.10480949<https://doi.org/10.1080%2F01621459.1976.10480949>



https://en.m.wikipedia.org/wiki/All_models_are_wrong



To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1



--

This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom




To unsubscribe from the CCP

Re: [ccp4bb] Another folding AI

2022-11-06 Thread Nave, Colin (DLSLtd,RAL,LSCI)
All these quotes are great fun and worth keeping in mind, along with many 
others. I hadn’t realised that the creationists had adopted James Clark Maxwell 
as one of their own. However he didn’t go as far as one present day scientist 
who believes the earth is no more than 10,000 years old. Which discipline? 
Palaeontology of course. This person is not even some old fossil.

I always liked George Box’s comments about models. I remember Eleanor Dodson 
saying, probably at some CCP4 refinement study weekend, that a structure 
obtained by protein crystallography was like a curate’s egg. She might have 
predated George Box with this thought.

There should not  be any doubt that AF2 models are useful though. The question 
is how far their usefulness extends.


From: CCP4 bulletin board  On Behalf Of Bryan Lepore
Sent: 05 November 2022 17:03
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] Another folding AI


And of course,



"... all models are approximations. Essentially, all models are wrong, but some 
are useful. However, the approximate nature of the model must always be borne 
in mind"



Box, G. E. P.; Draper, N. R. (1987)

Empirical Model-Building and Response Surfaces

John Wiley & Sons



"Since all models are wrong the scientist must be alert to what is importantly 
wrong. It is inappropriate to be concerned about mice when there are tigers 
abroad."



Box, George E. P. (1976)

"Science and 
statistics"

Journal of the American Statistical 
Association

71 (356): 791-799

doi:10.1080/01621459.1976.10480949



https://en.m.wikipedia.org/wiki/All_models_are_wrong



To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1

-- 
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom




To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/


Re: [ccp4bb] Another folding AI

2022-11-05 Thread Nave, Colin (DLSLtd,RAL,LSCI)
There is another quote from an equally revered physicist – James Clark Maxwell

“The actual science of logic is conversant at present only with things either 
certain, impossible, or entirely doubtful, none of which (fortunately) we have 
to reason on. Therefore the true logic for this world is the calculus of 
Probabilities, which takes account of the magnitude of the probability which 
is, or ought to be, in a reasonable man's mind.”

I will leave people to decide for themselves which of these quotes is most 
relevant to the various AI approaches to protein folding, other than to say 
that at least one of them covers the uncertainty in any structure predictions.

From: CCP4 bulletin board  On Behalf Of Bryan Lepore
Sent: 03 November 2022 12:22
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] Another folding AI


[ emphasis/bold font is mine]:



“In general we look for a new law by the following process. First we guess it. 
Then we compute the consequences of the guess to see what would be implied if 
this law that we guessed is right. Then we compare the result of the 
computation to nature, with experiment or experience, compare it directly with 
observation, to see if it works. If it disagrees with experiment it is wrong. 
In that simple statement is the key to science. It does not make any difference 
how beautiful your guess is. It does not make any difference how smart you are, 
who made the guess, or what his name is – if it disagrees with experiment it is 
wrong. That is all there is to it.”



-Richard Feynman

The Character of Physical Law (1965)
Chapter 7, “Seeking New Laws”, p.150 (Modern Library edition, 1994)

ISBN 0-679-60127-9



To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1

-- 
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom




To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/


Re: [ccp4bb] am I doing this right?

2021-10-21 Thread Nave, Colin (DLSLtd,RAL,LSCI)
Congratulations to James for starting this interesting discussion.

For those who are like me, nowhere near a black belt in statistics, the thread 
has included a number of distributions.  I have had to look up where these 
apply and investigate their properties.
As an example,
“The Poisson distribution is used to model the # of events in the future, 
Exponential distribution is used to predict the wait time until the very first 
event, and Gamma distribution is used to predict the wait time until the k-th 
event.”
A useful calculator for distributions can be found at
https://keisan.casio.com/menu/system/0540
a specific example is at
https://keisan.casio.com/exec/system/1180573179
where cumulative probabilities for a Poisson distribution can be found given 
values for x and lambda.

The most appropriate prior is another issue which has come up e.g. is a flat 
prior appropriate? I can see that a different prior would be appropriate for 
different areas of the detector (e.g. 1 pixel instead of 100 pixels) but the 
most appropriate prior seems a bit arbitrary to me. One of James’ examples was 
10^5 background photons distributed among  10^6 pixels – what is the most 
appropriate prior for this case? I presume it is OK to update the prior after 
each observation but I understand that it can create difficulties if not done 
properly.

Being able to select the prior is sometimes seen as a strength of Bayesian 
methods. However, as a strong advocate of Bayesian methods once put it, this is 
a bit like Achilles boasting about his heel!

I hope for some agreement among the black belts. It would be good to end up 
with some clarity about the most appropriate probability distributions and 
priors. Also, have we got clarity about the question being asked?

Thanks to all for the interesting points.

Colin
From: CCP4 bulletin board  On Behalf Of Randy John Read
Sent: 21 October 2021 13:23
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] am I doing this right?

Hi Kay,

No, I still think the answer should come out the same if you have good reason 
to believe that all the 100 pixels are equally likely to receive a photon (for 
instance because your knowledge of the geometry of the source and the detector 
says the difference in their positions is insignificant, i.e. part of your 
prior expectation). Unless the exact position of the spot where you detect the 
photon is relevant, detecting 1 photon on a big pixel and detecting the same 
photon on 1 of 100 smaller pixels covering the same area are equivalent events. 
What should be different in the analysis, if you're thinking about individual 
pixels, is that the expected value for a photon landing on any of the pixels 
will be 100 times lower for each of the smaller pixels than the single big 
pixel, so that the expected value of their sum is the same. You won't get to 
that conclusion without having a different prior probability for the two cases 
that reflects the 100-fold lower flux through the smaller area, regardless of 
the total power of the source.

Best wishes,
Randy


On 21 Oct 2021, at 13:03, Kay Diederichs 
mailto:kay.diederi...@uni-konstanz.de>> wrote:

Randy,

I must admit that I am not certain about my answer, but I lean toward thinking 
that the result (of the two thought experiments that you describe) is not the 
same. I do agree that it makes sense that the expectation value is the same, 
and the math that I sketched in 
https://www.jiscmail.ac.uk/cgi-bin/wa-jisc.exe?A2=CCP4BB;bdd31b04.2110 actually 
shows this. But the variance? To me, a 100-pixel patch with all zeroes is no 
different from sequentially observing 100 pixels, one after the other. For the 
first of these pixels, I have no idea what the count is, until I observe it. 
For the second, I am less surprised that it is 0 because I observed 0 for the 
first. And so on, until the 100th. For the last one, my belief that I will 
observe a zero before I read out the pixel is much higher than for the first 
pixel. The variance is just the inverse of the amount of error (squared) that 
we assign to our belief in the expectation value. And that amount of belief is 
very different. I find it satisfactory that the sigma goes down with the sqrt() 
of the number of pixels.

Also, I don't find an error in the math of my posting of Mon, 18 Oct 2021 
15:00:42 +0100 . I do think that a uniform prior is not realistic, but this 
does not seem to make much difference for the 100-pixel thought experiment.

We could change the thought experiment in the following way - you observe 99 
pixels with zero counts, and 1 with 1 count. Would you still say that both the 
big-pixel-single-observation and the 100-pixel experiment should give 
expectation value of 2 and variance of 2? I wouldn't.

Best wishes,
Kay

On Thu, 21 Oct 2021 09:00:23 +, Randy John Read 
mailto:rj...@cam.ac.uk>> wrote:


Just to be a bit clearer, I mean that the calculation of the expected value and 
its variance should give the same answer 

Re: [ccp4bb] am I doing this right?

2021-10-17 Thread Nave, Colin (DLSLtd,RAL,LSCI)
Hi James
For the case under consideration,  isn't the gamma distribution the maximum 
entropy prior i.e. a default with minimum information content. 
Colin

-Original Message-
From: CCP4 bulletin board  On Behalf Of James Holton
Sent: 17 October 2021 18:25
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] am I doing this right?

Thank you Gergely.  That is interesting!

I don't mind at all making this Bayesian, as long as it works!

Something I'm not quite sure about: does the prior distribution HAVE to be a 
gamma distribution? Not that that really narrows things down since there are an 
infinite number of them, but is that really the "i have no idea" prior? Or just 
a convenient closed-form choice? I've only just recently heard of conjugate 
priors.

Much appreciate any thoughts you may have on this,

-James


On 10/16/2021 3:48 PM, Gergely Katona wrote:
> Dear James,
>
> If I understand correctly you are looking for a single rate parameter to 
> describe the pixels in a block. It would also be possible to estimate the 
> rates for individual pixels or estimate the thickness of the sample from the 
> counts if you have a good model, that is where Bayesian methods really shine. 
> I tested the simplest first Bayesian network with 10 and 100 zero count 
> pixels, respectively:
>
> https://colab.research.google.com/drive/1TGJx2YT9I-qyOT1D9_HCC7G7as1KX
> g2e?usp=sharing
>
>
> The two posterior distributions are markedly different even if they start 
> from the same prior distribution, which I find more intuitive than the 
> frequentist treatment of uncertainty. You can test different parameters for 
> the gamma prior or change to another prior distribution. It is possible to 
> reduce the posterior distributions to their mean or posterior maximum, if 
> needed. If you are looking for an alternative to the Bayesian perspective 
> then this will not help, unfortunately.
>
> Best wishes,
>
> Gergely
>
> -Original Message-
> From: CCP4 bulletin board  On Behalf Of James 
> Holton
> Sent: den 16 oktober 2021 21:01
> To: CCP4BB@JISCMAIL.AC.UK
> Subject: Re: [ccp4bb] am I doing this right?
>
> Thank you everyone for your thoughtful and thought-provoking responses!
>
> But, I am starting to think I was not as clear as I could have been about my 
> question.  I am actually concerning myself with background, not necessarily 
> Bragg peaks.  With Bragg photons you want the sum, but for background you 
> want the average.
>
> What I'm getting at is: how does one properly weight a zero-photon 
> observation when it comes time to combine it with others?  Hopefully they are 
> not all zero.  If they are, check your shutter.
>
> So, ignoring Bragg photons for the moment (let us suppose it is a systematic 
> absence) what I am asking is: what is the variance, or, better yet,what is 
> the WEIGHT one should assign to the observation of zero photons in a patch of 
> 10x10 pixels?
>
> In the absence of any prior knowledge this is a difficult question, but a 
> question we kind of need to answer if we want to properly measure data from 
> weak images.  So, what do we do?
>
> Well, with the "I have no idea" uniform prior, it would seem that expectation 
> (Epix) and variance (Vpix) would be k+1 = 1 for each pixel, and therefore the 
> sum of Epix and Vpix over the 100 independent pixels is:
>
> Epatch=Vpatch=100 photons
>
> I know that seems weird to assume 100 photons should have hit when we 
> actually saw none, but consider what that zero-photon count, all by itself, 
> is really telling you:
> a) Epix > 20 ? No way. That is "right out". Given we know its Poisson 
> distributed, and that background is flat, it is VERY unlikely you have E that 
> big when you saw zero. Cross all those E values off your list.
> b) Epix=0 ? Well, that CAN be true, but other things are possible and all of 
> them are E>0. So, most likely E is not 0, but at least a little bit higher.
> c) Epix=1e-6 ?  Yeah, sure, why not?
> d) Epix= -1e-6 ?  No. Don't be silly.
> e) If I had to guess? Meh. 1 photon per pixel?  That would be k+1
>
> I suppose my objection to E=V=0 is because V=0 implies infinite confidence in 
> the value of E, and that we don't have. Yes, it is true that we are quite 
> confident in the fact that we did not see any photons this time, but the 
> remember that E and V are the mean and variance that you would see if you did 
> a million experiments under the same conditions. We are trying to guess those 
> from what we've got. Just because you've seen zero a hundred times doesn't 
> mean the 101st experiment won't give you a count.  If it does, then maybe 
> Epatch=0.01 and Epix=0.0001?  But what do you do before you see your first 
> photon?
> All you can really do is bracket it.
>
> But what if you come up with a better prior than "I have no idea" ?
> Well, we do have other pixels on the detector, and presuming the background 
> is flat, or at least smooth, maybe the average counts/pixel is a better prior?
>
> So, let 

Re: [ccp4bb] AI papers in experimental macromolecular structure determination

2021-08-04 Thread Nave, Colin (DLSLtd,RAL,LSCI)
Bernhard
What qualifies? Good question. 
There are plenty of books on AI/machine learning but, as always, it is more 
efficient/lazier to read reviews than the books themselves. I think the London 
Review of Books allows limited access to its articles so most should be able to 
read this
https://www.lrb.co.uk/the-paper/v43/n02/paul-taylor/insanely-complicated-hopelessly-inadequate?referrer=https%3A%2F%2Fwww.google.com%2F
It might be interesting (though perhaps not useful)  to classify the examples 
for macromolecular structure determination in to categories such as GOFAI etc. 
However, this particular term is rather pejorative as it would mean describing 
the developers as old fashioned!

Colin




-Original Message-
From: CCP4 bulletin board  On Behalf Of Bernhard Rupp
Sent: 03 August 2021 21:00
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] AI papers in experimental macromolecular structure 
determination

Maybe we should get to the root of this - what qualifies as machine learning 
and what not?

Do nonparametric predictors such as KDE qualify?

https://www.ruppweb.org/mattprob/default.html

Happy toa dd to the confusion.

-Original Message-
From: CCP4 bulletin board  On Behalf Of Tim Gruene
Sent: Tuesday, August 3, 2021 11:59
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] AI papers in experimental macromolecular structure 
determination

Hello Andrea,

profile fitting, like it is done in mosflm
(https://doi.org/10.1107/S090744499900846X) or evalccd, or ... probably also 
qualify as AI/machine learning.

Best wishes,
Tim

On Tue, 3 Aug 2021 11:43:06 +
"Thorn, Dr. Andrea"  wrote:

> Dear colleagues,
> I have compiled a list of papers that cover the application of 
> AI/machine learning methods in single-crystal structure determination 
> (mostly macromolecular crystallography) and single-particle Cryo-EM.
> The draft list is attached below.
> 
> If I missed any papers, please let me know. I will send the final list 
> back here, for the benefit of all who are interested in the topic.
> 
> Best wishes,
> 
> 
> Andrea.
> 
> 
> __
> General:
> - Gopalakrishnan, V., Livingston, G., Hennessy, D., Buchanan, B. & 
> Rosenberg, J. M. (2004). Acta Cryst D. 60, 1705–1716.
> - Morris, R. J. (2004). Acta Cryst D. 60, 2133–2143.
> 
> Micrograph preparation:
> - (2020). Journal of Structural Biology. 210, 107498.
> 
> Particle Picking:
> - Sanchez-Garcia, R., Segura, J., Maluenda, D., Carazo, J. M. & 
> Sorzano, C. O. S. (2018). IUCrJ. 5, 854–865.
> - Al-Azzawi, A., Ouadou, A., Tanner, J. J. & Cheng, J. (2019). BMC 
> Bioinformatics. 20, 1–26.
> - George, B., Assaiya, A., Roy, R. J., Kembhavi, A., Chauhan, R., 
> Paul, G., Kumar, J. & Philip, N. S. (2021). Commun Biol. 4, 1–12.
> - Lata, K. R., Penczek, P. & Frank, J. (1995). Ultramicroscopy. 58, 
> 381–391.
> - Nguyen, N. P., Ersoy, I., Gotberg, J., Bunyak, F. & White, T. A.
> (2021). BMC Bioinformatics. 22, 1–28.
> - Wang, F., Gong, H., Liu, G., Li, M., Yan, C., Xia, T., Li, X. & 
> Zeng, J. (2016). Journal of Structural Biology. 195, 325–336.
> - Wong, H. C., Chen, J., Mouche, F., Rouiller, I. & Bern, M. (2004).
> Journal of Structural Biology. 145, 157–167.
> 
> Motion description in Cryo-EM:
> - Matsumoto, S., Ishida, S., Araki, M., Kato, T., Terayama, K. & 
> Okuno, Y. (2021). Nat Mach Intell. 3, 153–160.
> - Zhong, E. D., Bepler, T., Berger, B. & Davis, J. H. (2021). Nat 
> Methods. 18, 176–185.
> 
> Local resolution:
> - Avramov, T. K., Vyenielo, D., Gomez-Blanco, J., Adinarayanan, S., 
> Vargas, J. & Si, D. (2019). Molecules. 24, 1181.
> - Ramírez-Aportela, E., Mota, J., Conesa, P., Carazo, J. M. & Sorzano, 
> C. O. S. (2019). IUCrJ. 6, 1054–1063.
> - (2021). QAEmap: A Novel Local Quality Assessment Method for Protein 
> Crystal Structures Using Machine Learning.
> 
> Map post-processing:
> - Sanchez-Garcia, R., Gomez-Blanco, J., Cuervo, A., Carazo, J. M., 
> Sorzano, C. O. S. & Vargas, J. (2020). BioRxiv. 2020.06.12.148296.
> 
> Secondary structure assignment in map:
> - Subramaniya, S. R. M. V., Terashi, G. & Kihara, D. (2019). Nat 
> Methods. 16, 911–917.
> - Li, R., Si, D., Zeng, T., Ji, S. & He, J. (2016). 2016 IEEE 
> International Conference on Bioinformatics and Biomedicine (BIBM), 
> Vol. pp. 41–46.
> - Si, D., Ji, S., Nasr, K. A. & He, J. (2012). Biopolymers. 97, 
> 698–708.
> - He, J. & Huang, S.-Y. Brief Bioinform.
> - Lyu, Z., Wang, Z., Luo, F., Shuai, J. & Huang, Y. (2021). Frontiers 
> in Bioengineering and Biotechnology. 9,.
> - Mostosi, P., Schindelin, H., Kollmannsberger, P. & Thorn, A.
> (2020). Angewandte Chemie International Edition.
> 
> Automatic structure building:
> - Alnabati, E. & Kihara, D. (2020). Molecules. 25, 82.
> - Si, D., Moritz, S. A., Pfab, J., Hou, J., Cao, R., Wang, L., Wu, T.
> & Cheng, J. (2020). Sci Rep. 10, 1–22.
> - Moritz, S. A., Pfab, J., Wu, T., Hou, J., Cheng, J., Cao, R., Wang, 
> L. & Si, D. (2019).
> - Chojnowski, G., Pereira, J. & Lamzin, V. S. (2019). Acta Cryst D.
> 75, 753–763.
> 
> 

Re: [ccp4bb] Can twinning be seen in the diffraction pattern?

2021-03-14 Thread Nave, Colin (DLSLtd,RAL,LSCI)
Dear Gerard, Marina and others.
I agree that using a multi-axis goniostat is one way of separating the spots. 
Also that  the separation of spots provided by the spatial resolution of the 
detector is often much better than the separation you could get from even very 
fine-sliced images. However, if both the beam divergence and fine phi slicing 
range are small, then the spots should be resolved independently of the 
direction of the large unit cell axis. Many beamlines are targeted to small 
focal spots to examine small crystals. The downside is often that the beam 
divergence is large, at least in the horizontal direction. Reducing the beam 
divergence by means of slits will lead to a reduction in flux. However, the 
spot to background ratio should be better with the more parallel beam and fine 
phi slicing option. Orienting the crystal appropriately with a multi-axis 
goniostat will reduce the number of images required but this should not be an 
issue with modern detectors. 

It would be interesting to find out the relevant parameters in this case. 
Detector - pixel size, detector distance
Beam - wavelength, divergence and beam size at crystal (both horizontal and 
vertical)
Data collection - rotation range per image, crystal orientation
Crystal - unit cell parameters
Plus of course any which I have forgotten!
Best regards
Colin

-Original Message-
From: CCP4 bulletin board  On Behalf Of Gerard Bricogne
Sent: 12 March 2021 18:05
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] Can twinning be seen in the diffraction pattern?

Dear Marina,

 Mark seems to have hit the nail on the head. The left-hand picture of your 
second jpg shows that you have an axis, at about 45 degrees from the horizontal 
(and hence from the rotation axis), along which the spots are very close. These 
spots seem to be only just separated on that picture, when that direction is 
parallel to the detector plane, but for values of the rotation angle away from 
that particular one, they are likely to collide or even overlap within the same 
diffraction image, especially if your image width is not especially small.

 When you have a crystal with such a long axis in real space, and therefore 
closely spaced spots in reciprocal space, you need to orient it in such a way 
that this axis be essentially aligned with the rotation axis, and to put the 
detector far enough that the close spots in that direction be resolved. The 
idea is that the angular separation of spots provided by the spatial resolution 
of the detector is much better than the separation you could get from even very 
fine-sliced images.

 To orient your crystals in this way you will benefit from using a beamline 
equipped with a multi-axis goniostat. The dataset you will collect with such an 
aligned axis will have a "cusp" of missing reflections around that axis, but 
that is a better outcome than the alternative of having the zillions of 
"angular overlaps" between distinct reflections that would be produced with a 
long-axis orientation such as that shown in your pictures.
That random overlap will produce the same effect on intensity statistics as 
twinning, but there would not be a consistent twinning fraction among all your 
measurements that you could then put into a refinement against them.

 Do your unit-cell parameters confirm this scenario?


 With best wishes,

  Gerard.

--
On Fri, Mar 12, 2021 at 06:29:06PM +0100, Mark J van Raaij wrote:
> Hi Marina,
> The close-together spots in the zoom inset of your figure I think are not 
> split spots, but separate reflections. They are close togeter because you 
> appear to have a unit cell with one axis much longer than the other two (we 
> work on elongated proteins, so we have some experience with that). The same 
> short distances are also clearly visible in the picture on the bottom left.
> If you index the image in MOSFLM and it puts boxes around both I'd conclude 
> they are separate reflections, but there are perhaps more sophisticated ways 
> to verify this.
> So I think its true that in this case the twinning is not (obviously) visible 
> in the diffraction pattern - but detected through intensity statistics later.
> Best wishes,
> Mark
> 
> Mark J van Raaij
> Dpto de Estructura de Macromoleculas
> Centro Nacional de Biotecnologia - CSIC calle Darwin 3
> E-28049 Madrid, Spain
> 
> 
> 
> > On 12 Mar 2021, at 11:30, Marina Gárdonyi 
> >  wrote:
> > 
> > 
> > Hello everyone,
> > 
> > I am a PhD student at the Philipps-University in Marburg and I am currently 
> > writing my thesis.
> > 
> > I have problems to understand whether in my case twinning can be seen in 
> > the diffraction pattern or not.
> > 
> > I know that it depends on the type of twinning wheter you can see it in the 
> > diffraction pattern. The crystal had a resolution of 2.2 A. During 
> > processing it seemed to have the space group P622, but in the end it was 
> > P3(2)21. With phenix Xtriage I found out, that 

Re: [ccp4bb] seek your opion on this weird diffractio pattern

2020-12-08 Thread Nave, Colin (DLSLtd,RAL,LSCI)
Hi Joseph
Great diffraction patterns. As Jon has just said, it could be zinc acetate. 
There are some very strong spots which would be consistent with a single 
crystal of this. However, something else seems to be present.
Could it instead be PEG crystals with some superlattice repeat? For unit cell 
dimensions of PEG see
https://pubs.acs.org/doi/abs/10.1021/ma60035a005
Some circles giving Bragg spacings would help. Also the rotation range. You 
should at least be able to estimate some of the cell dimensions and see if they 
are consistent with PEG, zinc acetate or some alternative. 
I seem to recall the subject of PEG crystals coming up before.
Colin

-Original Message-
From: CCP4 bulletin board  On Behalf Of Joseph Ho
Sent: 08 December 2020 15:47
To: CCP4BB@JISCMAIL.AC.UK
Subject: [ccp4bb] seek your opion on this weird diffractio pattern

Dear all:
We recently worked on 8kDa protein crystal structure. We obtained crystals in 
Zinc acetate and PEG8000 condition. However, we observed this unusual 
diffraction patterns. I am wondering if anyone observed this and know how this 
can occur. The cryoprotectant is glycerol.

Thank you for you help

Joseph



To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

-- 
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom




To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/


Re: [ccp4bb] External: Re: [ccp4bb] AlphaFold: more thinking and less pipetting (?)

2020-12-04 Thread Nave, Colin (DLSLtd,RAL,LSCI)
Michel
Yes, a good point. relevant to the difference between AlphaGo and AlphaFold2. 
My understanding is that Alpha Go did begin with information about previous 
games but after this, it played against itself and became significantly better. 
AlphaFold2 relied perhaps completely on knowledge of previous "games" but 
didn't have an opponent to play against.

There is a difference between the intrinsic nature of the folding problem and 
the successful implementation, using additional information,  of AlphaFold2. I 
was really asking about the intrinsic nature of the folding problem (and Chess, 
Go) but, in practice, the question is probably not particularly relevant.

It might be true, for single isolated proteins that "all the information 
required for the 3D structure is in the sequence." However, many proteins can 
and do form amyloids. I think it was Chris Dobson who pointed out that most 
sequences would form amyloids and only a small number of sequences, tuned by 
natural selection, would form useful folds. Even these could easily revert to 
amyloids (otherwise known as the precipitant in the crystallisation well). 
Chaperones get involved and there is the issue of kinetic rather than 
thermodynamic control. See also James Holton's comments about energy 
minimisation. All this just indicates that the problem would be very hard 
without known structures. However, the advantage for predicting structure from 
sequence is that one can assume that the vast majority of sequences people are 
interested in will fold in to something useful, rather than an amyloid. Of 
course spider silk forms amyloid fibres and they are structurally useful.

All interesting issues
  Colin


From: CCP4 bulletin board  On Behalf Of Michel Fodje
Sent: 04 December 2020 15:58
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] External: Re: [ccp4bb] AlphaFold: more thinking and less 
pipetting (?)

I think the results from AlphaFold2, although exciting and a breakthrough are 
being exaggerated just a bit.  We know that all the information required for 
the 3D structure is in the sequence. The protein folding problem is simply how 
to go from a sequence to the 3D structure. This is not a complex problem in the 
sense that cells solve it deterministically.  Thus the problem is due to lack 
of understanding and not due to complexity.  AlphaFold and all the others 
trying to solve this problem are "cheating" in that they are not just using the 
sequence, they are using other sequences like it (multiple-sequence 
alignments), and they are using all the structural information contained in the 
PDB.  All of this information is not used by the cells.   In short, unless 
AlphaFold2 now allows us to understand how exactly a single protein sequence 
produces a particular 3D structure, the protein folding problem is hardly 
solved in a theoretical sense. The only reason we know how well AlphaFold2 did 
is because the structures were solved and we could compare with the 
predictions, which means verification is lacking.

The protein folding problem will be solved when we understand how to go from a 
sequence to a structure, and can verify a given structure to be correct without 
experimental data. Even if AlphaFold2 got 99% of structures right, your next 
interesting target protein might be the 1%. How would you know?   Until then, 
what AlphaFold2 is telling us right now is that all (most) of the information 
present in the sequence that determines the 3D structure can be gleaned in bits 
and pieces scattered between homologous sequences, multiple-sequence 
alignments, and other protein 3D structures in the PDB.  Deep Learning allows a 
huge amount of data to be thrown at a problem and the back-propagation of the 
networks then allows careful fine-tuning of weights which determine how 
relevant different pieces of information are to the prediction.  The networks 
used here are humongous and a detailed look at the weights (if at all feasible) 
may point us in the right direction.


From: CCP4 bulletin board mailto:CCP4BB@JISCMAIL.AC.UK>> 
On Behalf Of Nave, Colin (DLSLtd,RAL,LSCI)
Sent: December 4, 2020 9:14 AM
To: CCP4BB@JISCMAIL.AC.UK<mailto:CCP4BB@JISCMAIL.AC.UK>
Subject: External: Re: [ccp4bb] AlphaFold: more thinking and less pipetting (?)

The subject line for Isabel's email is very good.

I do have a question (more a request) for the more computer scientist oriented 
people. I think it is relevant for where this technology will be going. It 
comes from trying to understand whether problems addressed by Alpha are NP, NP 
hard, NP complete etc. My understanding is that the previous successes of Alpha 
were for complete information games such as Chess and Go. Both the rules and 
the present position were available to both sides. The folding problem might be 
in a different category. It would be nice if someone could explain the 
difference (if any) between Go and the protein folding problem perhaps usi

Re: [ccp4bb] AlphaFold: more thinking and less pipetting (?)

2020-12-04 Thread Nave, Colin (DLSLtd,RAL,LSCI)
The subject line for Isabel's email is very good.

I do have a question (more a request) for the more computer scientist oriented 
people. I think it is relevant for where this technology will be going. It 
comes from trying to understand whether problems addressed by Alpha are NP, NP 
hard, NP complete etc. My understanding is that the previous successes of Alpha 
were for complete information games such as Chess and Go. Both the rules and 
the present position were available to both sides. The folding problem might be 
in a different category. It would be nice if someone could explain the 
difference (if any) between Go and the protein folding problem perhaps using 
the NP type categories.

Colin



From: CCP4 bulletin board  On Behalf Of Isabel 
Garcia-Saez
Sent: 03 December 2020 11:18
To: CCP4BB@JISCMAIL.AC.UK
Subject: [ccp4bb] AlphaFold: more thinking and less pipetting (?)

Dear all,

Just commenting that after the stunning performance of AlphaFold that uses AI 
from Google maybe some of us we could dedicate ourselves to the noble art of 
gardening, baking, doing Chinese Calligraphy, enjoying the clouds pass or 
everything together (just in case I have already prepared my subscription to 
Netflix).

https://www.nature.com/articles/d41586-020-03348-4

Well, I suppose that we still have the structures of complexes (at the moment). 
I am wondering how the labs will have access to this technology in the future 
(would it be for free coming from the company DeepMind - Google?). It seems 
that they have already published some code. Well, exciting times.

Cheers,

Isabel


Isabel Garcia-Saez  PhD
Institut de Biologie Structurale
Viral Infection and Cancer Group (VIC)-Cell Division Team
71, Avenue des Martyrs
CS 10090
38044 Grenoble Cedex 9
France
Tel.: 00 33 (0) 457 42 86 15
e-mail: isabel.gar...@ibs.fr
FAX: 00 33 (0) 476 50 18 90
http://www.ibs.fr/




To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1

-- 
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom




To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/


Re: [ccp4bb] [3dem] Which resolution?

2020-03-12 Thread Nave, Colin (DLSLtd,RAL,LSCI)
keep using fixed FSC thresholds, even for global 
resolution estimates, but I still don' t know whether Marin's 1/2-bit-based FSC 
criterion is correct (if I had to bet, I'd say not). Aiming for 1/2-bit 
information content per Fourier component may be the correct target to aim for, 
and fixed threshold are definitely not the way to go, but I am not convinced 
that the 2005 proposal is the correct way forward
(6) I propose a framework for deriving non-fixed FSC thresholds based on 
desired SNR and confidence levels. Under some conditions, my proposed 
thresholds behave similarly to Marin's 1/2-bit-based curve, which convinces me 
further that Marin really is onto something.

To re-iterate: the choice of target SNR (or information content) is independent 
of the choice of SNR estimator and of statistical testing framework.

Hope this helps,
Alexis



On Sat, Feb 22, 2020 at 2:06 AM Nave, Colin (DLSLtd,RAL,LSCI) 
mailto:colin.n...@diamond.ac.uk>> wrote:
Alexis
This is a very useful summary.

You say you were not convinced by Marin's derivation in 2005. Are you convinced 
now and, if not, why?

My interest in this is that the FSC with half bit thresholds have the danger of 
being adopted elsewhere because they are becoming standard for protein 
structure determination (by EM or MX). If it is used for these mature 
techniques it must be right!

It is the adoption of the ½ bit threshold I worry about. I gave a rather weak 
example for MX which consisted of partial occupancy of side chains, substrates 
etc. For x-ray imaging a wide range of contrasts can occur and, if you want to 
see features with only a small contrast above the surroundings then I think the 
half bit threshold would be inappropriate.

It would be good to see a clear message from the MX and EM communities as to 
why an information content threshold of ½ a bit is generally appropriate for 
these techniques and an acknowledgement that this threshold is 
technique/problem dependent.

We might then progress from the bronze age to the iron age.

Regards
Colin



From: CCP4 bulletin board mailto:CCP4BB@JISCMAIL.AC.UK>> 
On Behalf Of Alexis Rohou
Sent: 21 February 2020 16:35
To: CCP4BB@JISCMAIL.AC.UK<mailto:CCP4BB@JISCMAIL.AC.UK>
Subject: Re: [ccp4bb] [3dem] Which resolution?

Hi all,

For those bewildered by Marin's insistence that everyone's been messing up 
their stats since the bronze age, I'd like to offer what my understanding of 
the situation. More details in this thread from a few years ago on the exact 
same topic:
https://mail.ncmir.ucsd.edu/pipermail/3dem/2015-August/003939.html
https://mail.ncmir.ucsd.edu/pipermail/3dem/2015-August/003944.html

Notwithstanding notational problems (e.g. strict equations as opposed to 
approximation symbols, or omission of symbols to denote estimation), I believe 
Frank & Al-Ali and "descendent" papers (e.g. appendix of Rosenthal & Henderson 
2003) are fine. The cross terms that Marin is agitated about indeed do in fact 
have an expectation value of 0.0 (in the ensemble; if the experiment were 
performed an infinite number of times with different realizations of noise). I 
don't believe Pawel or Jose Maria or any of the other authors really believe 
that the cross-terms are orthogonal.

When N (the number of independent Fouier voxels in a shell) is large enough, 
mean(Signal x Noise) ~ 0.0 is only an approximation, but a pretty good one, 
even for a single FSC experiment. This is why, in my book, derivations that 
depend on Frank & Al-Ali are OK, under the strict assumption that N is large. 
Numerically, this becomes apparent when Marin's half-bit criterion is plotted - 
asymptotically it has the same behavior as a constant threshold.

So, is Marin wrong to worry about this? No, I don't think so. There are indeed 
cases where the assumption of large N is broken. And under those circumstances, 
any fixed threshold (0.143, 0.5, whatever) is dangerous. This is illustrated in 
figures of van Heel & Schatz (2005). Small boxes, high-symmetry, small objects 
in large boxes, and a number of other conditions can make fixed thresholds 
dangerous.

It would indeed be better to use a non-fixed threshold. So why am I not using 
the 1/2-bit criterion in my own work? While numerically it behaves well at most 
resolution ranges, I was not convinced by Marin's derivation in 2005. 
Philosophically though, I think he's right - we should aim for FSC thresholds 
that are more robust to the kinds of edge cases mentioned above. It would be 
the right thing to do.

Hope this helps,
Alexis



On Sun, Feb 16, 2020 at 9:00 AM Penczek, Pawel A 
mailto:pawel.a.penc...@uth.tmc.edu>> wrote:
Marin,

The statistics in 2010 review is fine. You may disagree with assumptions, but I 
can assure you the “statistics” (as you call it) is fine. Careful reading of 
the paper would reveal to you this much.
Regards,
Pawel

On Feb 16, 2020, at 10:38 AM, Marin van Heel 
mailto:marin.vanh...@googlemail.

Re: [ccp4bb] [3dem] Which resolution?

2020-02-27 Thread Nave, Colin (DLSLtd,RAL,LSCI)
James
All you say seems sensible to me but there is the possibility of confusion 
regarding the use of the word threshold. I fully agree that a half bit 
information threshold is inappropriate if it is taken to mean that the data 
should be truncated at that resolution. The ever more sophisticated refinement 
programs are becoming adept at handling the noisy data.

The half bit information threshold I was discussing refers to a nominal 
resolution. This is not just for trivial reporting purposes. The half bit 
threshold is being used to compare imaging methods and perhaps demonstrate that 
significant information is present with a dose below any radiation damage 
threshold (that word again). The justification for doing this appears to come 
from the fact it has been adopted for protein structure determination by single 
particle electron microscopy. However, low contrast features might not be 
visible at this nominal resolution.

The analogy with protein crystallography might be to collect data below an 
absorption edge to give a nominal resolution of 2 angstrom. Then do it again 
well above the absorption edge. The second one gives much greater Bijvoet 
differences despite the fact that the nominal resolution is the same. I doubt 
whether anyone doing this would be misled by this as they would examine the 
statistics for the Bijvoet differences instead. However, it does indicate the 
relationship between contrast and resolution.

The question, if referring to an information threshold for nominal resolution, 
could be “Is there significant information in the data at the required contrast 
and resolution?”. Then “Can one obtain this information at a dose below any 
radiation damage limit”

Keep posting!
Regards
  Colin
From: James Holton 
Sent: 27 February 2020 01:14
To: CCP4BB@JISCMAIL.AC.UK
Cc: Nave, Colin (DLSLtd,RAL,LSCI) 
Subject: Re: [ccp4bb] [3dem] Which resolution?

In my opinion the threshold should be zero bits.  Yes, this is where CC1/2 = 0 
(or FSC = 0).  If there is correlation then there is information, and why throw 
out information if there is information to be had?  Yes, this information comes 
with noise attached, but that is why we have weights.

It is also important to remember that zero intensity is still useful 
information.  Systematic absences are an excellent example.  They have no 
intensity at all, but they speak volumes about the structure.  In a similar 
way, high-angle zero-intensity observations also tell us something.  Ever tried 
unrestrained B factor refinement at poor resolution?  It is hard to do nowadays 
because of all the safety catches in modern software, but you can get great R 
factors this way.  A telltale sign of this kind of "over fitting" is remarkably 
large Fcalc values beyond the resolution cutoff.  These don't contribute to the 
R factor, however, because Fobs is missing for these hkls. So, including 
zero-intensity data suppresses at least some types of over-fitting.

The thing I like most about the zero-information resolution cutoff is that it 
forces us to address the real problem: what do you mean by "resolution" ?  Not 
long ago, claiming your resolution was 3.0 A meant that after discarding all 
spots with individual I/sigI < 3 you still have 80% completeness in the 3.0 A 
bin.  Now we are saying we have a 3.0 A data set when we can prove 
statistically that a few non-background counts fell into the sum of all spot 
areas at 3.0 A.  These are not the same thing.

Don't get me wrong, including the weak high-resolution information makes the 
model better, and indeed I am even advocating including all the noisy zeroes.  
However, weak data at 3.0 A is never going to be as good as having strong data 
at 3.0 A.  So, how do we decide?  I personally think that the resolution 
assigned to the PDB deposition should remain the classical I/sigI > 3 at 80% 
rule.  This is really the only way to have meaningful comparison of resolution 
between very old and very new structures.  One should, of course, deposit all 
the data, but don't claim that cut-off as your "resolution".  That is just 
plain unfair to those who came before.

Oh yeah, and I also have a session on "interpreting low-resolution maps" at the 
GRC this year.  
https://www.grc.org/diffraction-methods-in-structural-biology-conference/2020/

So, please, let the discussion continue!

-James Holton
MAD Scientist
On 2/22/2020 11:06 AM, Nave, Colin (DLSLtd,RAL,LSCI) wrote:
Alexis
This is a very useful summary.

You say you were not convinced by Marin's derivation in 2005. Are you convinced 
now and, if not, why?

My interest in this is that the FSC with half bit thresholds have the danger of 
being adopted elsewhere because they are becoming standard for protein 
structure determination (by EM or MX). If it is used for these mature 
techniques it must be right!

It is the adoption of the ½ bit threshold I worry about. I gave a rather weak 
example for MX which

Re: [ccp4bb] [3dem] Which resolution?

2020-02-22 Thread Nave, Colin (DLSLtd,RAL,LSCI)
Alexis
This is a very useful summary.

You say you were not convinced by Marin's derivation in 2005. Are you convinced 
now and, if not, why?

My interest in this is that the FSC with half bit thresholds have the danger of 
being adopted elsewhere because they are becoming standard for protein 
structure determination (by EM or MX). If it is used for these mature 
techniques it must be right!

It is the adoption of the ½ bit threshold I worry about. I gave a rather weak 
example for MX which consisted of partial occupancy of side chains, substrates 
etc. For x-ray imaging a wide range of contrasts can occur and, if you want to 
see features with only a small contrast above the surroundings then I think the 
half bit threshold would be inappropriate.

It would be good to see a clear message from the MX and EM communities as to 
why an information content threshold of ½ a bit is generally appropriate for 
these techniques and an acknowledgement that this threshold is 
technique/problem dependent.

We might then progress from the bronze age to the iron age.

Regards
Colin



From: CCP4 bulletin board  On Behalf Of Alexis Rohou
Sent: 21 February 2020 16:35
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] [3dem] Which resolution?

Hi all,

For those bewildered by Marin's insistence that everyone's been messing up 
their stats since the bronze age, I'd like to offer what my understanding of 
the situation. More details in this thread from a few years ago on the exact 
same topic:
https://mail.ncmir.ucsd.edu/pipermail/3dem/2015-August/003939.html
https://mail.ncmir.ucsd.edu/pipermail/3dem/2015-August/003944.html

Notwithstanding notational problems (e.g. strict equations as opposed to 
approximation symbols, or omission of symbols to denote estimation), I believe 
Frank & Al-Ali and "descendent" papers (e.g. appendix of Rosenthal & Henderson 
2003) are fine. The cross terms that Marin is agitated about indeed do in fact 
have an expectation value of 0.0 (in the ensemble; if the experiment were 
performed an infinite number of times with different realizations of noise). I 
don't believe Pawel or Jose Maria or any of the other authors really believe 
that the cross-terms are orthogonal.

When N (the number of independent Fouier voxels in a shell) is large enough, 
mean(Signal x Noise) ~ 0.0 is only an approximation, but a pretty good one, 
even for a single FSC experiment. This is why, in my book, derivations that 
depend on Frank & Al-Ali are OK, under the strict assumption that N is large. 
Numerically, this becomes apparent when Marin's half-bit criterion is plotted - 
asymptotically it has the same behavior as a constant threshold.

So, is Marin wrong to worry about this? No, I don't think so. There are indeed 
cases where the assumption of large N is broken. And under those circumstances, 
any fixed threshold (0.143, 0.5, whatever) is dangerous. This is illustrated in 
figures of van Heel & Schatz (2005). Small boxes, high-symmetry, small objects 
in large boxes, and a number of other conditions can make fixed thresholds 
dangerous.

It would indeed be better to use a non-fixed threshold. So why am I not using 
the 1/2-bit criterion in my own work? While numerically it behaves well at most 
resolution ranges, I was not convinced by Marin's derivation in 2005. 
Philosophically though, I think he's right - we should aim for FSC thresholds 
that are more robust to the kinds of edge cases mentioned above. It would be 
the right thing to do.

Hope this helps,
Alexis



On Sun, Feb 16, 2020 at 9:00 AM Penczek, Pawel A 
mailto:pawel.a.penc...@uth.tmc.edu>> wrote:
Marin,

The statistics in 2010 review is fine. You may disagree with assumptions, but I 
can assure you the “statistics” (as you call it) is fine. Careful reading of 
the paper would reveal to you this much.
Regards,
Pawel


On Feb 16, 2020, at 10:38 AM, Marin van Heel 
mailto:marin.vanh...@googlemail.com>> wrote:


 EXTERNAL EMAIL 
Dear Pawel and All others 
This 2010 review is - unfortunately - largely based on the flawed statistics I 
mentioned before, namely on the a priori assumption that the inner product of a 
signal vector and a noise vector are ZERO (an orthogonality assumption).  The 
(Frank & Al-Ali 1975) paper we have refuted on a number of occasions (for 
example in 2005, and most recently in our BioRxiv paper) but you still take 
that as the correct relation between SNR and FRC (and you never cite the 
criticism...).
Sorry
Marin

On Thu, Feb 13, 2020 at 10:42 AM Penczek, Pawel A 
mailto:pawel.a.penc...@uth.tmc.edu>> wrote:
Dear Teige,

I am wondering whether you are familiar with

Resolution measures in molecular electron microscopy.
Penczek PA. Methods Enzymol. 2010.
Citation

Methods Enzymol. 2010;482:73-100. doi: 10.1016/S0076-6879(10)82003-8.

You will find there answers to all questions you asked and much more.

Regards,
Pawel Penczek

Regards,
Pawel
___
3dem mailing list

Re: [ccp4bb] [3dem] Which resolution?

2020-02-20 Thread Nave, Colin (DLSLtd,RAL,LSCI)
Dear Randy
Yes this makes sense.
Certainly cut offs are bad – I hope I my post wasn’t implying one should cut 
off the data at some particular resolution shell. Some reflections in a shell 
will be weak and some stronger. Knowing which are which is of course 
information.
I will have a look at the 2019 CCP4 study weekend paper
Regards
Colin

From: Randy Read 
Sent: 20 February 2020 11:45
To: Nave, Colin (DLSLtd,RAL,LSCI) 
Cc: CCP4BB@jiscmail.ac.uk
Subject: Re: [ccp4bb] [3dem] Which resolution?

Dear Colin,

Over the last few years we've been implementing measures of information gain to 
evaluate X-ray diffraction data in our program Phaser. Some results in a paper 
that has been accepted for publication in the 2019 CCP4 Study Weekend special 
issue are relevant to this discussion.

First, looking at data deposited in the PDB, we see that the information gain 
in the highest resolution shell is typically about 0.5-1 bit per reflection 
(though we haven't done a comprehensive analysis yet).  A very rough 
calculation suggests that a half-bit resolution threshold is equivalent to 
something like an I/SIGI threshold of one.  So that would fit with the idea 
that a possible resolution limit measure would be the resolution where the 
average information per reflection drops to half a bit.

Second, even if the half-bit threshold is where the data are starting to 
contribute less to the image and to likelihood targets for tasks like molecular 
replacement and refinement, weaker data still contribute some useful signal 
down to limits as low as 0.01 bit per reflection.  So any number attached to 
the nominal resolution of a data set should not necessarily be applied as a 
resolution cutoff, at least as long as the refinement target (such as our 
log-likelihood-gain on intensity or LLGI score) accounts properly for large 
measurement errors.

Best wishes,

Randy


On 20 Feb 2020, at 10:15, Nave, Colin (DLSLtd,RAL,LSCI) 
mailto:colin.n...@diamond.ac.uk>> wrote:

Dear all,
I have received a request to clarify what I mean by threshold in my 
contribution of 17 Feb  below and then post the clarification on CCP4BB. Being 
a loyal (but very sporadic) CCP4BBer I am now doing this. My musings in this 
thread are as much auto-didactic as didactic. In other words I am trying to 
understand it all myself.

Accepting that the FSC is a suitable metric (I believe it is) I think the most 
useful way of explaining the concept of the threshold is to refer to section 
4.2 and fig. 4 of Heel and Schatz (2005), Journal of Structural Biology, 151, 
250-262. Figure 4C show an FSC together with a half bit information curve and 
figure 4D shows the FSC with a 3sigma curve.

The point I was trying to make in rather an obtuse fashion is that the choice 
of threshold will depend on what one is trying to see in the image. I will try 
and give an example related to protein structures rather than uranium hydride 
or axons in the brain. In general protein structures consist of atoms with 
similar scattering power (C, N, O with the hydrogens for the moment invisible) 
and high occupancy. When we can for example distinguish side chains along the 
backbone we have a good basis for starting to interpret the map as a particular 
structure. An FSC with a half bit threshold at the appropriate resolution 
appears to be a good guide to whether one can do this. However, if a particular 
sidechain is disordered with 2 conformations, or a substrate is only 50% 
occupied, the contribution in the electron density map is reduced and might be 
difficult to distinguish from the noise. A  higher threshold might be necessary 
to see these atoms but this would occur at a lower resolution than given by the 
half bit threshold. One could instead increase the exposure to improve the 
resolution but of course radiation damage lurks. For reporting structures, the 
obvious thing to do is to show the complete FSC curves together with a few 
threshold curves (e.g. half bit, one bit, 2 bits). This would enable people to 
judge whether the data is likely to meet their requirements. This of course 
departs significantly from the desire to have one number. A compromise might be 
to report FSC resolutions at several thresholds.

I understand that fixed value thresholds (e.g. 0.143) were originally adopted 
for EM to conform to standards prevalent for crystallography at the time. This 
would have enabled comparison between the two techniques. For many cases (as 
stated in Heel and Schatz) there will be little difference between the 
resolution given by a half bit and that given by 0.143. However, if the former 
is mathematically correct and easy to implement then why not use it for all 
techniques? The link to Shannon is a personal reason I have for preferring a 
threshold based on information content. If I had scientific “heroes” he would 
be one of them.


I have recently had a paper on x-ray imaging of biological cells accepted for 
publication. This includes

“In

Re: [ccp4bb] [3dem] Which resolution?

2020-02-20 Thread Nave, Colin (DLSLtd,RAL,LSCI)
Dear all,
I have received a request to clarify what I mean by threshold in my 
contribution of 17 Feb  below and then post the clarification on CCP4BB. Being 
a loyal (but very sporadic) CCP4BBer I am now doing this. My musings in this 
thread are as much auto-didactic as didactic. In other words I am trying to 
understand it all myself.

Accepting that the FSC is a suitable metric (I believe it is) I think the most 
useful way of explaining the concept of the threshold is to refer to section 
4.2 and fig. 4 of Heel and Schatz (2005), Journal of Structural Biology, 151, 
250-262. Figure 4C show an FSC together with a half bit information curve and 
figure 4D shows the FSC with a 3sigma curve.

The point I was trying to make in rather an obtuse fashion is that the choice 
of threshold will depend on what one is trying to see in the image. I will try 
and give an example related to protein structures rather than uranium hydride 
or axons in the brain. In general protein structures consist of atoms with 
similar scattering power (C, N, O with the hydrogens for the moment invisible) 
and high occupancy. When we can for example distinguish side chains along the 
backbone we have a good basis for starting to interpret the map as a particular 
structure. An FSC with a half bit threshold at the appropriate resolution 
appears to be a good guide to whether one can do this. However, if a particular 
sidechain is disordered with 2 conformations, or a substrate is only 50% 
occupied, the contribution in the electron density map is reduced and might be 
difficult to distinguish from the noise. A  higher threshold might be necessary 
to see these atoms but this would occur at a lower resolution than given by the 
half bit threshold. One could instead increase the exposure to improve the 
resolution but of course radiation damage lurks. For reporting structures, the 
obvious thing to do is to show the complete FSC curves together with a few 
threshold curves (e.g. half bit, one bit, 2 bits). This would enable people to 
judge whether the data is likely to meet their requirements. This of course 
departs significantly from the desire to have one number. A compromise might be 
to report FSC resolutions at several thresholds.

I understand that fixed value thresholds (e.g. 0.143) were originally adopted 
for EM to conform to standards prevalent for crystallography at the time. This 
would have enabled comparison between the two techniques. For many cases (as 
stated in Heel and Schatz) there will be little difference between the 
resolution given by a half bit and that given by 0.143. However, if the former 
is mathematically correct and easy to implement then why not use it for all 
techniques? The link to Shannon is a personal reason I have for preferring a 
threshold based on information content. If I had scientific “heroes” he would 
be one of them.


I have recently had a paper on x-ray imaging of biological cells accepted for 
publication. This includes

“In order to compare theory or simulations with experiment, standard methods of 
reporting results covering parameters such as the feature examined (e.g. which 
cellular organelle), resolution, contrast, depth of material (for 2D), estimate 
of noise and dose should be encouraged. Much effort has gone in to doing this 
for fields such as macromolecular crystallography but it has to be admitted 
that this is still an ongoing process.”
I think recent activity agrees with the last 6 words!



Don’t read the next bit if not interested in the relationship between the Rose 
criterion and FSC thresholds.

The recently submitted paper also includes

“A proper analysis of the relationship between the Rose criterion and FSC 
thresholds is outside the scope of this paper and would need to take account of 
factors such as the number of image voxels, whether one is in an atomicity or 
uniform voxel regime and the contrast of features to be identified in the 
image.”

This can justifiably be interpreted as saying I did not fully understand the 
relationship itself and was a partial reason why I raised the issue in another 
message to this thread.
Who cares anyway about the headline resolution? Well, defining a resolution can 
be important if one wants to calculate the exposure required to see particular 
features and whether they are then degraded by radiation damage. This relates 
to the issue I raised concerning the Rose criterion. As an example one might 
have a virus particle with an average density of 1.1 embedded in an object (a 
biological cell) of density 1.0 (I am keeping the numbers simple). The virus 
has a diameter of 50nm. There are 5000 voxels in the image (the number 5000 was 
used by Rose when analysing images from televisions). This gives 5000 chances 
of a false alarm so, I want to ensure the signal to noise ratio in the image is 
sufficiently high. This is why Rose adopted a contrast to noise ratio of 5 
(Rose criterion K of 5). For each voxel in the image we 

Re: [ccp4bb] FW: [ccp4bb] [3dem] Which resolution?

2020-02-17 Thread Nave, Colin (DLSLtd,RAL,LSCI)
Hi John
I agree that neutrons have a role to increase the contrast for certain atoms. 
The “water window” for x-ray imaging also fulfils a similar role. The “locally 
scaled in a complex way” is a bit beyond me.

The relationship between “ diffraction” errors and “imaging” errors is  based 
on Parseval’s theorem applied to the errors for electron densities and 
structure factors.  See for example 
https://www-structmed.cimr.cam.ac.uk/Course/Fourier/Fourier.html and scroll 
down to Parseval’s theorem. Admittedly not a primary reference but I think 
Randy (and Parseval, not to be confused with Wagner’s opera), are unlikely to 
have got it wrong.

Imaging (with both electrons and x-rays) can be lensless (as in MX, CDI and 
variants) or with an objective lens (electron microscopes have nice objective 
lenses). The physical processes are the same up to any lens but MX, CDI etc. 
use a computer to replace the lens. The computer algorithm might be imperfect 
resulting in visible termination errors. With a decent lens, one can also see 
diffraction ripples (round bright stars in a telescope image) due to the 
restricted lens aperture.

Good debate though.

Colin
From: John R Helliwell 
Sent: 17 February 2020 16:36
To: Nave, Colin (DLSLtd,RAL,LSCI) 
Cc: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] FW: [ccp4bb] [3dem] Which resolution?

Hi Colin,
Neutrons are applied to the uranyl hydrides so as to make their scattering 
lengths much more equal than with X-rays, and so side step ripple effects of 
the uranium in the Xray case, which obscures those nearby hydrogens.
In terms of feature resolvability the email exchange (and there may be better 
ones):- http://www.phenix-online.org/pipermail/phenixbb/2017-March/023326.html
refers to “locally scaled in a complex way”. So, is the physics of the 
visibility of features really comparable between the two methods of cryoEM and 
crystal structure analysis?
Greetings,
John
Emeritus Professor John R Helliwell DSc




On 17 Feb 2020, at 13:59, Nave, Colin (DLSLtd,RAL,LSCI) 
mailto:colin.n...@diamond.ac.uk>> wrote:

Hi John
I agree that if I truncate the data at a high information content threshold 
(e.g. 2 bits)  series termination errors might hide the lighter atoms (e.g. the 
hydrogens in uranium hydride crystal structures). However, I think this is 
purely a limitation of producing electron density maps via Fourier transforms 
(i.e. not the physics). A variety of techniques are available for handling 
series termination including ones which are maximally non-committal with 
respect to the missing data. The issue is still there in some fields (see 
https://onlinelibrary.wiley.com/iucr/itc/Ha/ch4o8v0001/ ). For protein 
crystallography perhaps series termination errors have become less important as 
people are discouraged from applying some I/sigI type cut off.

Cheers
Colin



From: John R Helliwell mailto:jrhelliw...@gmail.com>>
Sent: 17 February 2020 12:09
To: Nave, Colin (DLSLtd,RAL,LSCI) 
mailto:colin.n...@diamond.ac.uk>>
Subject: Re: [ccp4bb] FW: [ccp4bb] [3dem] Which resolution?

Hi Colin,
I think the physics of the imaging and the crystal structure analysis, 
respectively without and with Fourier termination ripples, are different. For 
the MX re Fourier series for two types of difference map see our contribution:-

http://scripts.iucr.org/cgi-bin/paper?S0907444903004219

Greetings,
John

Emeritus Professor John R Helliwell DSc
https://www.crcpress.com/The-Whats-of-a-Scientific-Life/Helliwell/p/book/9780367233020





On 17 Feb 2020, at 11:26, 
"colin.n...@diamond.ac.uk<mailto:colin.n...@diamond.ac.uk>" 
mailto:CCP4BB@JISCMAIL.AC.UK>> 
On Behalf Of Petrus Zwart
Sent: 16 February 2020 21:50
To: CCP4BB@JISCMAIL.AC.UK<mailto:CCP4BB@JISCMAIL.AC.UK>
Subject: Re: [ccp4bb] [3dem] Which resolution?

Hi All,

How is the 'correct' resolution estimation related to the estimated error on 
some observed hydrogen bond length of interest, or an error on the estimated 
occupancy of a ligand or conformation or anything else that has structural 
significance?

In crystallography, it isn't really (only in some very approximate fashion), 
and I doubt that in EM there is something to that effect. If you want to use 
the resolution to get a gut feeling on how your maps look and how your data 
behaves, it doesn't really matter what standard you use, as long as you are 
consistent in the use of the metric you use. If you want to use this estimate 
to get to uncertainties of model parameters, you better try something else.

Regards
Peter Zwart



On Sun, Feb 16, 2020 at 8:38 AM Marin van Heel 
<057a89ab08a1-dmarc-requ...@jiscmail.ac.uk<mailto:057a89ab08a1-dmarc-requ...@jiscmail.ac.uk>>
 wrote:
Dear Pawel and All others 
This 2010 review is - unfortunately - largely based on the flawed statistics I 
mentioned before, namely on the a priori assumption that the inner product of a 
signal vector and a noise vector are ZERO