Re: [ccp4bb] To Trim or Not to To Trim

2023-03-19 Thread Lijun Liu
Hi James:First of all, I think all those trim-or-not-to-trim practices are kind of compromises when the data did not really offer local density strong enough to model side chain reliably.  So many smart pioneers could not make a simple agreement, which means this science of art has the personality-dependent tribute too.  Depending on how to interpret(model builder) and how to understand (reader), I am ok with a) or c) in your list and personally dislike the zero occupancy treatment (sorry if this may annoying many ones).  The reason is that if a “side chain conformation” is modeled with a 0 occupancy, the reading of the information turns to be that all other conformations are possible (occupancies sums to 1.0) but this modeled one (normally what preferred?) which is simply the impossible one for its 0 occupancy and particularly mentioned —— logically a paradox.Your test modeling and refinement of a lysine is interesting, but the experiment may not be flawless when trying to compare those different strategies.  The experiment may be based on a dataset that really offers side chain modeling difficulty for its very weak densities around the expected location.  The huge difference of CCtrue  in your experiment probably just means the model’s completeness (4 out of 10 non-H atoms missing in the trim treatment), and your dataset may actually allow building side chain(s), fully or partially, one or more confs, based on your description.  Also, this system is too small?  In a normal macromolecular refinement (say more than 100 residues) , not trying to say not important, but a few side chain’s treatment in different ways may not diff these refinement statistics so significantly (not like a single bond’s huge deviation could increase the overall rmsd by many times).LijunSent from my iPhoneOn Mar 19, 2023, at 11:30 PM, Lijun Liu  wrote:James:Sent from my iPhoneOn Mar 19, 2023, at 2:37 PM, James Holton  wrote:
  

  
  
They say one test is worth a thousand expert opinions, so I tried my
hand at the former.

The question is: what is the right way to treat disordered side
chains?:
a) omit atoms you cannot see
b) build them, and set occupancy to zero
c) build them, and "let the B factors take care of it"
d) none of the above

The answer, of course, is d).

Oh, c'mon.  Yes, I know one of a,b, or c is what you've been doing
your whole life. I do it too.  But, let's face it: none of these
solutions are perfect.  So, the real question is not which one is
"right", but which is the least wrong?  

We all know what is really going on: the side chain is flapping
around. No doubt it spends most of its time in energetically
reasonable but nevertheless numerous conformations.  There are 41
"Favorable" rotamers for Lys alone, and it doesn't take that many to
spread the density thin enough to fall below the classical 1-sigma
contour level. The atoms are still there, they are still
contributing to the data, and they haven't gone far. So why don't we
"just" model that?  Already, I can hear the cries of "over-fitting!"
and "observations/parameters!", "model bias!", and "think of the
children!"  Believe it or not, none of these are the major issue
here. Allow me to demonstrate:

Consider a simple case where we have a Lys side chain in ten
conformers. I chose from popular rotamers, but evenly spread. That
is, all 10 conformers have an occupancy of 0.10, and there is a
3-3-4 split of chi1 values between minus, plus and trans.  This will
give the maximum contrast of density between CB and CG.  Let us
further require that there is no strain in this ground-truth. No
stretched bonds, no tortured angles, no clashes, etc.  Real
molecules don't occupy such high-energy states unless they
absolutely have to.  Let us further assume that the bulk solvent
works the way phenix models it, which is a probe radius of 1.1 A for
both ions and aliphatics and a shrink radius of 0.9.  But, instead
of running one phenix.fmodel job, I ran ten: one for each conformer
(A thru J).  To add some excitement, I moved the main chain ~0.2 A
in a random direction for each conformer. I then took these ten
calculated electron density maps (bulk solvent and all) and added
them together to form the ground truth for the following trials.
Before refinement, I added noise consistent with an I/sigma of 50
and cut the resolution at 2.0 A. Wilson B is 50:

CCtrue   Rwork%  Rfree%   fo-fc(sigma)  
  description
  0.8943 9.05   10.60  5.9     stump at CB
  0.9540 9.29   11.73  6.0     single conformer, zero
  occupancy
  0.9471    10.35   15.04      5.1         single conformer, full
  occupancy, refmac5
  0.9523 9.78   15.61  4.9         single conformer, full
  occupancy, phenix.refine
   
So, it would appear 

Re: [ccp4bb] To Trim or Not to To Trim

2023-03-19 Thread Lijun Liu
James:Sent from my iPhoneOn Mar 19, 2023, at 2:37 PM, James Holton  wrote:
  

  
  
They say one test is worth a thousand expert opinions, so I tried my
hand at the former.

The question is: what is the right way to treat disordered side
chains?:
a) omit atoms you cannot see
b) build them, and set occupancy to zero
c) build them, and "let the B factors take care of it"
d) none of the above

The answer, of course, is d).

Oh, c'mon.  Yes, I know one of a,b, or c is what you've been doing
your whole life. I do it too.  But, let's face it: none of these
solutions are perfect.  So, the real question is not which one is
"right", but which is the least wrong?  

We all know what is really going on: the side chain is flapping
around. No doubt it spends most of its time in energetically
reasonable but nevertheless numerous conformations.  There are 41
"Favorable" rotamers for Lys alone, and it doesn't take that many to
spread the density thin enough to fall below the classical 1-sigma
contour level. The atoms are still there, they are still
contributing to the data, and they haven't gone far. So why don't we
"just" model that?  Already, I can hear the cries of "over-fitting!"
and "observations/parameters!", "model bias!", and "think of the
children!"  Believe it or not, none of these are the major issue
here. Allow me to demonstrate:

Consider a simple case where we have a Lys side chain in ten
conformers. I chose from popular rotamers, but evenly spread. That
is, all 10 conformers have an occupancy of 0.10, and there is a
3-3-4 split of chi1 values between minus, plus and trans.  This will
give the maximum contrast of density between CB and CG.  Let us
further require that there is no strain in this ground-truth. No
stretched bonds, no tortured angles, no clashes, etc.  Real
molecules don't occupy such high-energy states unless they
absolutely have to.  Let us further assume that the bulk solvent
works the way phenix models it, which is a probe radius of 1.1 A for
both ions and aliphatics and a shrink radius of 0.9.  But, instead
of running one phenix.fmodel job, I ran ten: one for each conformer
(A thru J).  To add some excitement, I moved the main chain ~0.2 A
in a random direction for each conformer. I then took these ten
calculated electron density maps (bulk solvent and all) and added
them together to form the ground truth for the following trials.
Before refinement, I added noise consistent with an I/sigma of 50
and cut the resolution at 2.0 A. Wilson B is 50:

CCtrue   Rwork%  Rfree%   fo-fc(sigma)  
  description
  0.8943 9.05   10.60  5.9     stump at CB
  0.9540 9.29   11.73  6.0     single conformer, zero
  occupancy
  0.9471    10.35   15.04      5.1         single conformer, full
  occupancy, refmac5
  0.9523 9.78   15.61  4.9         single conformer, full
  occupancy, phenix.refine
   
So, it would appear that the zero-occupancy choice "wins", but by
the narrowest of margins.  Here CCtrue is the Pearson correlation
coefficient between the ground-truth right-answer electron density
and the 2fofc map resulting from the refinement.  Rwork and Rfree
are the usual suspects, and fo-fc indicates the tallest peak in the
difference map. Refinement was with refmac unless otherwise
indicated. I think we often forget that both phenix and refmac
restrain B factor values, not just through bonds but through space,
and they use rather different algorithms. Refmac tries to make the
histogram of B factors "look right", whereas phenix allows steeper
gradients. I also ran all 10 correct rotamers separately and picked
the one with the best CCtrue to show above. If you instead sort on
Rfree (which you really shouldn't do), you get different bests, but
they are not much better (as low as 10.5%).  So, the winner here
depends on how you score.  CCtrue is the best score, but also
unfortunately unavailable for real data.

  It is perhaps interesting here that better CCtrue goes along with
worse Rfree. This is not what I usually see in experiments like
this. Rather, what I think is going on here is the system is
frustrated. We are trying to fit various square pegs into a round
hole, and none of them fit all that well. 

In all cases here the largest difference peak was indicating another
place to put the Lys, so why not build into that screaming, 6-sigma
difference peak?  Here is what happens when you do that:

CCtrue   Rwork%  Rfree%   fo-fc(sigma)  
  description
0.8943 9.05   10.60  5.9
          stump
  at CB

Re: [ccp4bb] To Trim or Not to To Trim

2023-03-19 Thread James Holton
They say one test is worth a thousand expert opinions, so I tried my 
hand at the former.


The question is: what is the right way to treat disordered side chains?:
a) omit atoms you cannot see
b) build them, and set occupancy to zero
c) build them, and "let the B factors take care of it"
d) none of the above

The answer, of course, is d).

Oh, c'mon.  Yes, I know one of a,b, or c is what you've been doing your 
whole life. I do it too.  But, let's face it: none of these solutions 
are perfect.  So, the real question is not which one is "right", but 
which is the least wrong?


We all know what is really going on: the side chain is flapping around. 
No doubt it spends most of its time in energetically reasonable but 
nevertheless numerous conformations.  There are 41 "Favorable" rotamers 
for Lys alone, and it doesn't take that many to spread the density thin 
enough to fall below the classical 1-sigma contour level. The atoms are 
still there, they are still contributing to the data, and they haven't 
gone far. So why don't we "just" model that?  Already, I can hear the 
cries of "over-fitting!" and "observations/parameters!", "model bias!", 
and "think of the children!"  Believe it or not, none of these are the 
major issue here. Allow me to demonstrate:


Consider a simple case where we have a Lys side chain in ten conformers. 
I chose from popular rotamers, but evenly spread. That is, all 10 
conformers have an occupancy of 0.10, and there is a 3-3-4 split of chi1 
values between minus, plus and trans.  This will give the maximum 
contrast of density between CB and CG.  Let us further require that 
there is no strain in this ground-truth. No stretched bonds, no tortured 
angles, no clashes, etc.  Real molecules don't occupy such high-energy 
states unless they absolutely have to.  Let us further assume that the 
bulk solvent works the way phenix models it, which is a probe radius of 
1.1 A for both ions and aliphatics and a shrink radius of 0.9.  But, 
instead of running one phenix.fmodel job, I ran ten: one for each 
conformer (A thru J).  To add some excitement, I moved the main chain 
~0.2 A in a random direction for each conformer. I then took these ten 
calculated electron density maps (bulk solvent and all) and added them 
together to form the ground truth for the following trials. Before 
refinement, I added noise consistent with an I/sigma of 50 and cut the 
resolution at 2.0 A. Wilson B is 50:


CCtrue   Rwork%  Rfree%   fo-fc(sigma) description
0.8943 9.05   10.60  5.9     stump at CB
0.9540 9.29   11.73  6.0     single conformer, zero occupancy
0.9471    10.35   15.04      5.1         single conformer, full 
occupancy, refmac5
0.9523 9.78   15.61  4.9         single conformer, full 
occupancy, phenix.refine


So, it would appear that the zero-occupancy choice "wins", but by the 
narrowest of margins.  Here CCtrue is the Pearson correlation 
coefficient between the ground-truth right-answer electron density and 
the 2fofc map resulting from the refinement.  Rwork and Rfree are the 
usual suspects, and fo-fc indicates the tallest peak in the difference 
map. Refinement was with refmac unless otherwise indicated. I think we 
often forget that both phenix and refmac restrain B factor values, not 
just through bonds but through space, and they use rather different 
algorithms. Refmac tries to make the histogram of B factors "look 
right", whereas phenix allows steeper gradients. I also ran all 10 
correct rotamers separately and picked the one with the best CCtrue to 
show above. If you instead sort on Rfree (which you really shouldn't 
do), you get different bests, but they are not much better (as low as 
10.5%).  So, the winner here depends on how you score.  CCtrue is the 
best score, but also unfortunately unavailable for real data.


  It is perhaps interesting here that better CCtrue goes along with 
worse Rfree. This is not what I usually see in experiments like this. 
Rather, what I think is going on here is the system is frustrated. We 
are trying to fit various square pegs into a round hole, and none of 
them fit all that well.


In all cases here the largest difference peak was indicating another 
place to put the Lys, so why not build into that screaming, 6-sigma 
difference peak?  Here is what happens when you do that:


CCtrue   Rwork%  Rfree%   fo-fc(sigma) description
0.8943 9.05   10.60  5.9         stump at CB
0.9580 9.95   11.60  6.4         stump at CG
0.9585 10.20 12.29  6.2         stump at CG, all 10 confs
0.9543    10.61   12.24  5.3         stump at CD, all 10 confs
0.9383    10.69   14.64 4.1 stump at CE, all 10 confs
0.9476 9.66   13.48 4.6 all atoms, all 10 confs
0.9214     7.09    11.8 5.6 three conformers (worst of 120 combos)
0.9718 6.53 8.55 4.3 three conformers (best of 120 combos)
0.9710     7.17    9.44 6.1 two conformers (best of 45 combos)
0.9471 10.35 15.04 5.1 single conformer (best of 10 choices)


Re: [ccp4bb] To Trim or Not to To Trim

2023-03-19 Thread Pietro Roversi

Dear all,

well at least ChatGPT is still way off the heart of the matter and 
several steps behind some of the best views in the community.

I asked it:

Please write a paragraph on the subject: "10 things I hate about 
structural biology data deposition policy"


It gave me this:

"As an AI language model, I do not have personal opinions or feelings. 
However, I can provide a general response to the topic you have given 
me. Structural biology data deposition policies are critical to ensure 
that research data is accessible and reproducible, facilitating 
scientific progress. However, some researchers may find these policies 
frustrating, especially when they encounter obstacles during the 
process. Ten things that researchers may dislike about structural 
biology data deposition policies could include complicated data formats, 
unclear data preparation instructions, data release delays, data sharing 
restrictions, lack of feedback from data repositories, non-standardized 
metadata requirements, confidentiality concerns, data loss risks, high 
submission fees, and legal obligations. Despite these challenges, data 
deposition policies remain necessary to promote transparency, 
collaboration, and scientific integrity in the field of structural 
biology."


Ciao

Pietro




On 18-03-2023 10:28, Dale Tronrud wrote:

I'm going to dive back in here again to expand this discussion.
Whether this diversion clarifies or obscures issues surrounding the
"crystallographers' dilemma" I'll leave for others to decide.

   There is currently considerable discussion, among people who care
about cell phone cameras, over the behavior of the cameras in some
Samsung cell phones when photos are taken that include the Moon.

https://www.reddit.com/r/Android/comments/11nzrb0/samsung_space_zoom_moon_shots_are_fake_and_here/

   In this post, evidence is presented that photos taken with one of
these phone cameras, and includes the Earth's largest satellite, the
image shows a properly exposed and detailed image of the orb.  This
despite the fact that the test photograph described in the post is of
a tableau containing a deliberately blurred photo of the cratered
globe. The claim is made that the Samsung app is adding information
from sources other than the camera's light sensor and therefore the
image is "fake".

   I expect that Samsung would reply that, once the app is confident
that the silver disk in the image is the nighttime traditional symbol
of romance, it is perfectly reasonable to make that, now identified
aerial, phenomenon appear in the image as expected by every sighted
human in the history of our species.  There have been billions of high
quality photos of the silicate sphere taken. (This is both literally
true and a gross underestimate.)  How can the photo be fake if it
better reflects what the photographer saw than what can be deduced
from only the raw pixels of the sensor?

   Of course, this example differs only in degree from common practice
going back to the beginning of photography.  Photos have always been
modified, sometimes in order to deceive the viewer, but most often to
make the photos more like what the photographer believed the scene
actually looked like.  For example, in nearly every photo I take I
"correct" the color balance.

   Is the photo with a detailed Moon fake?  Are my photos taken at the
forest floor, but without everything being some shade of green, fake?
I think most people would be satisfied if there was a way for them to
know what sources of information were used in creating the image.

   We, as scientists, are much more demanding of our PDB models.  We
build better models when we use all the knowledge at our disposal.  If
we are interpreting a 9A resolution map of hemoglobin and see a
disk-shaped piece of density where we know the heme goes, we are
perfectly justified to build an atomic model of heme.  We are also
obliged to make clear that the exact atomic positions, bond lengths
and angles, were not derived from that map, just as a journalist needs
to make clear to the reader that their photo has been processed to
include detail which was not present when the image was "taken".

   I have deposited models that contained features which were only
"consistent" with the electron density but supported by enough other
forms of evidence to make me confident in their existence.  I have
done my best to make the justification of these models clear in the
reports I have written but continue to be frustrated by the lack of
tools to represent the precise interplay of data sources that support
my model WITHIN the deposition.  I am not so naive to believe that
everyone who has cited my papers have actually read them.

Dale E. Tronrud


On 3/10/2023 1:05 AM, Julia Griese wrote:

Hi all,

My impression has been that the most common approach these days is to 
“let the B-factors take care of it”, but I might be wrong. Maybe it’s 
time to run another poll?


Personally, I call any other approach