[ccp4bb] TRUNCATE & CTRUNCATE issues.

Ian Tickle Sat, 30 Nov 2013 07:39:40 -0800

Hello All, I'd like to get your views on some changes I'm proposing to make
to the TRUNCATE source code.  IMO there are some issues with the way
TRUNCATE does its statistical analyses which need to be fixed.  This
probably doesn't affect the MTZ file that comes out of it but it will
affect the statistics of the twinning tests.  I've been meaning to do this
for some time but it involves some fairly radical changes; now I've finally
decided to bite the bullet.


One problem is that there seems to be some confusion in the source code
comments concerning the meaning of what I call the "symmetry enhancement
factor" (aka "epsilon" or "e" below). This is the point-group dependent
factor by which the mean intensities of special rows and zones are enhanced
by symmetry; for example in PG121 the mean I of the 0k0 reflections is
enhanced by a factor of 2; in PG6 & PG622, it's a factor of 6 for the 00l
reflections, and so on.

For space groups with screw axes (or glide planes in enantiomorphic SGs,
i.e. the crystal contains both the asymmetric unit and its mirror image),
the mean is multiplied by e only if the systematic absences are omitted
from the sums; if you include them then there is no overall enhancement.
So rotation and screw axes (or mirrors and glide planes) need to be treated
differently, yet the CCP4 library code for this (s/rs EPSLN & EPSLON) treat
them identically (more on this below).  TRUNCATE calls epsilon
alternatively a "multiplicity" or "weight" which further adds to the
confusion since multiplicity (m) usually means something completely
different (it's the number of times a symmetry-equivalent reflection occurs
in a full hemisphere of data, so for PG222 axial reflections (h00 etc.)
m=1, for zero layers (hk0 etc.) m=2, and for general hkl m=4: in contrast e
= 2, 1 and 1 resp.).

This snippet of code from TRUNCATE which accumulates sums in bins according
to d*^2 illustrates the problem:

      CALL EPSLON(INHKL,WEIGHT,ISYSAB)
C
      FF(NT) = FF(NT) + F/WEIGHT
      SD(NT) = SD(NT) + SS/WEIGHT
      N(NT) = N(NT) + 1
      AMULT = NSYM/WEIGHT
C
...
C     Accumulate sums for Wilson plot.
C
      SN(NT) = SN(NT) + AMULT
      SW(NT) = SW(NT) + AMULT*FFSCAT
      SR(NT) = SR(NT) + AMULT*Q
      SI(NT) = SI(NT) + AMULT*F

Here WEIGHT = epsilon (the ISYSAB flag is ignored throughout), so all sums
are being accumulated with the terms multiplied by 1/e (NSYM is the no. of
asymmetric units so is constant and doesn't affect the results).  However
IMO this factor should be applied only to individual intensities or their
SDs (note that in the above code and that below, F is actually the
uncorrected intensity just to further confuse you!).  The problem here is
that for pure rotation axes the law of conservation of energy requires that
the overall mean I is unchanged.  In any interference phenomenon energy can
neither be created nor destroyed, merely transferred from one place to
another.  So what happens is that the enhanced intensity of the axial
reflections has been transferred from neighbouring reflections which have
their intensities diminished in total by the same amount.  In fact an
oscillating Bessel function centred on the axis is superimposed on the
intensities so you get cylindical zones of alternating enhanced and
diminishing intensities with the magnitude of the oscillations dying away
as you go further from the axis.  However the energy conservation law
requires that the net overall average I must be unchanged by the presence
of the axis.

This implies that the 1/e correction factor SHOULD NOT be included for pure
rotation axes (and mirror planes) when summing for the mean I.  However the
sums should be performed over a complete hemisphere given only one symmetry
equivalent per reflection, which means that the multiplicity (m) factor
SHOULD be included.  This is the direct opposite of what the code above is
doing (i.e. it includes e but not m!).  The Fortran code would look like:

       SI(NT) = SI(NT) + M*FI     # Total energy is conserved!
       SN(NT) = SN(NT) + M       # Count reflections in hemisphere.

Note that the statistically valid procedure will be different if one is say
summing Is for a likelihood function, since this requires that the terms
are statistically independent so one would then add only one term per
equivalent, not a complete hemisphere, i.e. in that case the multiplicity
factors should be omitted.

For screw axes (and glide planes) the situation is different: there is no
Bessel function and only the axial reflections are affected, so therefore
it requires different handling in the code (as I said above, this is not
happening!).  Now, since systematic absences are normally not present in
the data, the mean intensity of the remaining reflections is enhanced by
the e factor, purely by the action of omitting the systematic absences of
zero I.  This implies that we need to simulate the presence of the
systematic absences when taking the mean.  So for example in the PG6 case
we would have to sum the intensities of the 00l, l=6n reflections WITHOUT
correction, but then count each 00l reflection as though it were 6
reflections (i.e. also counting the omitted sys. abs. in the average), so
the code would now look like:

       SI(NT) = SI(NT) + M*FI     # Total energy is still conserved!
       SN(NT) = SN(NT) +M*E    # but also count the sys. abs. that were
omitted.

This implies that at least for the reflection counts, the e correction
factor SHOULD be included for screw axes (and glide planes), and for the
same reason as above the m factor SHOULD also be included.

The differences between rotation and screw axes (or between mirrors and
glides) arise because Wilson's assumption of uniform random distribution of
atoms breaks down in the former case: an atom cannot approach a rotation
axis or mirror plane closer than its VDW radius, so this excluded zone
along the axis or plane causes an interference effect.  In both cases the
main effect is actually that in projection along the axis the atomic
positions are not random: they are correlated by an apparent inversion
centre (it's actually another interference effect this time from pairs of
atoms lying in the same plane of reflection and related by symmetry).

All the above is relevant ONLY to taking the average intensity.  For other
purposes the correct procedure is likely to differ.  For example if one is
interested in the individual normalised structure amplitudes for direct
methods (i.e. not just the mean), then the sqrt(1/e) factor clearly SHOULD
be applied to the individual amplitudes.  Also if one is calculating higher
moments of Z (= normalised I) then the deviations will not cancel (there's
nothing in the energy conservation law that says that energy^n is conserved
if n is not 0 or 1).  In the rotation axis case each large on-axis positive
deviation tends to be offset by several small off-axis deviations, so in
that case the optimal procedure would appear to be to multiply the on-axis
Is by m/e before summing for the higher moments (say n >= 2), e.g.:

      SM(N) = SM(N) + M*(FI/E)**N
      NM(N) = NM(N) + M

For the screw-axis case with the sys. abs. omitted the situation is again
different and requires different code; IMO we should distribute an amount
I/e from each axial reflection equally among the omitted sys. abs. and then
include them as though they were all present:

      SM(N) = SM(N) + M*E*(FI/E)**N
      NM(N) = NM(N) + M*E

One further issue needs to be addressed (TRUNCATE is not short of issues -
the ones I mention here are only a fraction!).  The code for calculating
the moments in TRUNCATE is:

C---- Sums for moments of I
         F = FFA(1)/WEIGHT
         IF(F.GT.0.0) SMMEMM(1,NT,ICEN) = SMMEMM(1,NT,ICEN) + sqrt(F)
         IF(F.GT.0.0) SMMEMM(3,NT,ICEN) = SMMEMM(3,NT,ICEN) + F*sqrt(F)
         SMMIMM(1,NT,ICEN) = SMMIMM(1,NT,ICEN) + F
         SMMIMM(2,NT,ICEN) = SMMIMM(2,NT,ICEN) + F**2
         SMMIMM(3,NT,ICEN) = SMMIMM(3,NT,ICEN) + F**3
         SMMIMM(4,NT,ICEN) = SMMIMM(4,NT,ICEN) + F**4
         NMMNUM(NT,ICEN) = NMMNUM(NT,ICEN) + 1

Notwithstanding that the WEIGHT factor (aka epsilon) probably should not be
applied to the lower moments, and the multiplicity factor probably should
be applied everywhere, there is still a problem here.  F is the uncorrected
intensity but IMO we should be using the corrected intensity (i.e. by F &
W's Bayesian procedure): why else are we calculating the correction if not
to apply it?  The uncorrected intensities may be negative which adds
spurious noise to the moments (a negative intensity makes the same
contribution to an even moment as an equal positive intensity: this cannot
possibly be correct).  However using the corrected Is brings another
problem: when calculating the moments we should be using the expected
values based on the posterior probability distribution of the Is.  This
means integrating the moments over all expected Is > 0, multiplied by the
probability density.  This is non-trivial, however there is a way using the
parabolic cylinder functions U(a,x) (
http://en.wikipedia.org/wiki/Parabolic_cylinder_functions).  The nice thing
about this method is one can generalise it to calculate the
Bayesian-corrected F and I as well as arbitrary moments and get rid of F &
W's cubic spline interpolation code with the slightly worrying warning (see
NEGIAS & NEGICS s/rs in TRUNCATE):

C---- Accuracy - better than 5 percent  (i think).

With the PCFs the code becomes both much simpler and much more accurate (to
REAL*4 precision, so ~ 0.0001% accuracy!).

I should also say that I believe that CTRUNCATE is not immune from these
issues: it seems to be a straight translation at least in part from Fortran
to C++ (though I'm far from being a C++ expert so I'll leave fixing
CTRUNCATE to those who know what they're doing!).

Sorry about the length of this: at least you can't say I didn't consult you!

Cheers

-- Ian

[ccp4bb] TRUNCATE & CTRUNCATE issues.

Reply via email to