Re: [ccp4bb] TRUNCATE & CTRUNCATE issues.

Robbie Joosten Wed, 04 Dec 2013 01:24:55 -0800

Hi Ian,

> This probably doesn't affect the MTZ file that comes out of it 
> but it will affect the statistics of the twinning tests.

I'd be interested to see if this solves the cases where ctruncate goes into an 
infinite loop during the twinning tests. 

Cheers,
Robbie

Date: Sat, 30 Nov 2013 15:37:44 +0000
From: [email protected]
Subject: [ccp4bb] TRUNCATE & CTRUNCATE issues.
To: [email protected]

Hello All, I'd like to get your views on some changes I'm proposing to make to 
the TRUNCATE source code.  IMO there are some issues with the way TRUNCATE does 
its statistical analyses which need to be fixed.  This probably doesn't affect 
the MTZ file that comes out of it but it will affect the statistics of the 
twinning tests.  I've been meaning to do this for some time but it involves 
some fairly radical changes; now I've finally decided to bite the bullet.

One problem is that there seems to be some confusion in the source code 
comments concerning the meaning of what I call the "symmetry enhancement 
factor" (aka "epsilon" or "e" below). This is the point-group dependent factor 
by which the mean intensities of special rows and zones are enhanced by 
symmetry; for example in PG121 the mean I of the 0k0 reflections is enhanced by 
a factor of 2; in PG6 & PG622, it's a factor of 6 for the 00l reflections, and 
so on.

For space groups with screw axes (or glide planes in enantiomorphic SGs, i.e. 
the crystal contains both the asymmetric unit and its mirror image), the mean 
is multiplied by e only if the systematic absences are omitted from the sums; 
if you include them then there is no overall enhancement.  So rotation and 
screw axes (or mirrors and glide planes) need to be treated differently, yet 
the CCP4 library code for this (s/rs EPSLN & EPSLON) treat them identically 
(more on this below).  TRUNCATE calls epsilon alternatively a "multiplicity" or 
"weight" which 
further adds to the confusion since multiplicity (m) usually means something
 completely different (it's the number of times a symmetry-equivalent 
reflection occurs in a full hemisphere of data, so for PG222 axial reflections 
(h00 etc.) m=1, for zero layers (hk0 etc.) m=2, and for general hkl m=4: in 
contrast e = 2, 1 and 1 resp.).

This snippet of code from TRUNCATE which accumulates sums in bins according to 
d*^2 illustrates the problem:

      CALL EPSLON(INHKL,WEIGHT,ISYSAB)
C
      FF(NT) = FF(NT) + F/WEIGHT
      SD(NT) = SD(NT) + SS/WEIGHT

      N(NT) = N(NT) + 1
      AMULT = NSYM/WEIGHT

C

...
C     Accumulate sums for Wilson plot.
C

      SN(NT) = SN(NT) + AMULT
      SW(NT) = SW(NT) + AMULT*FFSCAT   
      SR(NT) = SR(NT) + AMULT*Q
      SI(NT) = SI(NT) + AMULT*F

Here WEIGHT = epsilon (the ISYSAB flag is ignored throughout), so all sums are 
being accumulated with the terms multiplied by 1/e (NSYM is the no. of 
asymmetric units so is constant and doesn't affect the results).  However IMO 
this factor should be applied only to individual intensities or their SDs (note 
that in the above code and that below, F is actually the uncorrected intensity 
just to further confuse you!).  The problem here is that for pure rotation axes 
the law of conservation of energy requires that the overall mean I is 
unchanged.  In any interference phenomenon energy can neither be created nor 
destroyed, merely transferred from one place to another.  So what happens is 
that the enhanced intensity of the axial reflections has been transferred from 
neighbouring reflections which have their intensities diminished in total by 
the same amount.  In fact an oscillating Bessel function centred on the axis is 
superimposed on the intensities so you get cylindical zones of alternating 
enhanced and diminishing intensities with the magnitude of the oscillations 
dying away as you go further from the axis.  However the energy conservation 
law requires that the net overall average I must be unchanged by the presence 
of the axis.

This implies that the 1/e correction factor SHOULD NOT be included for pure 
rotation axes (and mirror planes) when summing for the mean I.  However the 
sums should be performed over a complete hemisphere given only one symmetry 
equivalent per reflection, which means that the multiplicity (m) factor SHOULD 
be included.  This is the direct opposite of what the code above is doing (i.e. 
it includes e but not m!).  The Fortran code would look like:

       SI(NT) = SI(NT) + M*FI     # Total energy is conserved!
       SN(NT) = SN(NT) + M       # Count reflections in hemisphere.

Note that the statistically valid procedure will be different if one is say 
summing Is for a likelihood function, since this requires that the terms are 
statistically independent so one would then add only one term per equivalent, 
not a complete hemisphere, i.e. in that case the multiplicity factors should be 
omitted.

For screw axes (and glide planes) the situation is different: there is no 
Bessel function and only the axial reflections are affected, so therefore it 
requires different handling in the code (as I said above, this is not 
happening!).  Now, since systematic absences are normally not present in the 
data, the mean intensity of the remaining reflections is enhanced by the e 
factor, purely by the action of omitting the systematic absences of zero I.  
This implies that we need to simulate the presence of the systematic absences 
when taking the mean.  So for example in the PG6 case we would have to sum the 
intensities of the 00l, l=6n reflections WITHOUT correction, but then count 
each 00l reflection as though it were 6 reflections (i.e. also counting the 
omitted sys. abs. in the average), so the code would now look like:

       SI(NT) = SI(NT) + M*FI     # Total energy is still conserved!       
SN(NT) = SN(NT) +M*E    # but also count the sys. abs. that were omitted.

This implies that at least for the reflection counts, the e correction factor 
SHOULD be included for screw axes (and glide planes), and for the same reason 
as above the m factor SHOULD also be included.

The differences between rotation and screw axes (or between mirrors and glides) 
arise because Wilson's assumption of uniform random distribution of atoms 
breaks down in the former case: an atom cannot approach a rotation axis or 
mirror plane closer than its VDW radius, so this excluded zone along the axis 
or plane causes an interference effect.  In both cases the main effect is 
actually that in projection along the axis the atomic positions are not random: 
they are correlated by an apparent inversion centre (it's actually another 
interference effect this time from pairs of atoms lying in the same plane of 
reflection and related by symmetry).

All the above is relevant ONLY to taking the average intensity.  For other 
purposes the correct procedure is likely to differ.  For example if one is 
interested in the individual normalised structure amplitudes for direct methods 
(i.e. not just the mean), then the sqrt(1/e) factor clearly SHOULD be applied 
to the individual amplitudes.  Also if one is calculating higher moments of Z 
(= normalised I) then the deviations will not cancel (there's nothing in the 
energy conservation law that says that energy^n is conserved if n is not 0 or 
1).  In the rotation axis case each large on-axis positive deviation tends to 
be offset by several small off-axis deviations, so in that case the optimal 
procedure would appear to be to multiply the on-axis Is by m/e before summing 
for the higher moments (say n >= 2), e.g.:

      SM(N) = SM(N) + M*(FI/E)**N
      NM(N) = NM(N) + M

For the screw-axis case with the sys. abs. omitted the situation is again 
different and requires different code; IMO we should distribute an amount I/e 
from each axial reflection equally among the omitted sys. abs. and then include 
them as though they were all present:

      SM(N) = SM(N) + M*E*(FI/E)**N
      NM(N) = NM(N) + M*E

One further issue needs to be addressed (TRUNCATE is not short of issues - the 
ones I mention here are only a fraction!).  The code for calculating the 
moments in TRUNCATE is:

C---- Sums for moments of I
         F = FFA(1)/WEIGHT
         IF(F.GT.0.0) SMMEMM(1,NT,ICEN) = SMMEMM(1,NT,ICEN) + sqrt(F)
         IF(F.GT.0.0) SMMEMM(3,NT,ICEN) = SMMEMM(3,NT,ICEN) + F*sqrt(F)
         SMMIMM(1,NT,ICEN) = SMMIMM(1,NT,ICEN) + F

         SMMIMM(2,NT,ICEN) = SMMIMM(2,NT,ICEN) + F**2
         SMMIMM(3,NT,ICEN) = SMMIMM(3,NT,ICEN) + F**3
         SMMIMM(4,NT,ICEN) = SMMIMM(4,NT,ICEN) + F**4
         NMMNUM(NT,ICEN) = NMMNUM(NT,ICEN) + 1

Notwithstanding that the WEIGHT factor (aka epsilon) probably should not be 
applied to the lower moments, and the multiplicity factor probably should be 
applied everywhere, there is still a problem here.  F is the uncorrected 
intensity but IMO we should be using the corrected intensity (i.e. by F & W's 
Bayesian procedure): why else are we calculating the correction if not to apply 
it?  The uncorrected intensities may be negative which adds spurious noise to 
the moments (a negative intensity makes the same contribution to an even moment 
as an equal positive intensity: this cannot possibly be correct).  However 
using the corrected Is brings another problem: when calculating the moments we 
should be using the expected values based on the posterior probability 
distribution of the Is.  This means integrating the moments over all expected 
Is > 0, multiplied by the probability density.  This is non-trivial, however 
there is a way using the parabolic cylinder functions U(a,x) 
(http://en.wikipedia.org/wiki/Parabolic_cylinder_functions).  The nice thing 
about this method is one can generalise it to calculate the Bayesian-corrected 
F and I as well as arbitrary moments and get rid of F & W's cubic spline 
interpolation code with the slightly worrying warning (see NEGIAS & NEGICS s/rs 
in TRUNCATE):

C---- Accuracy - better than 5 percent  (i think).

With the PCFs the code becomes both much simpler and much more accurate (to 
REAL*4 precision, so ~ 0.0001% accuracy!).

I should also say that I believe that CTRUNCATE is not immune from these 
issues: it seems to be a straight translation at least in part from Fortran to 
C++ (though I'm far from being a C++ expert so I'll leave fixing CTRUNCATE to 
those who know what they're doing!).

Sorry about the length of this: at least you can't say I didn't consult you!

Cheers

-- Ian

Re: [ccp4bb] TRUNCATE & CTRUNCATE issues.

Reply via email to