Re: [ccp4bb] [phenixbb] C-beta RMSD

2015-06-26 Thread Douglas Theobald
THESEUS can do it, and it comes bundled with ccp4 so definitely on-topic.

If you want RMSD of “equivalent” amino acids, you must tell THESEUS which 
residues are equivalent with a sequence alignment.  Then use the -I option to 
get the RMSD (and other stats) of the pdb files in their current orientation.  
E.g.,

theseus -A cytc.aln -I d1cih__.pdb d1crj__.pdb

Cheers,

Douglas



 On Jun 26, 2015, at 3:52 AM, Kaushik Hatti hskaus...@gmail.com wrote:
 
 Hi,
 
 Is there a tool which can calculate C-beta RMSD for equivalent amino acids of 
 homologous structures, post C-alpha superposition? 
 
 Sorry if its off topic,
 Thanks,
 Kaushik
 
 -- 
 Stupidity is everyone’s birthright.  However, only the learned exercise it!
 --Kaushik (28Oct2014)
 ___
 phenixbb mailing list
 pheni...@phenix-online.org
 http://phenix-online.org/mailman/listinfo/phenixbb
 Unsubscribe: phenixbb-le...@phenix-online.org


Re: [ccp4bb] [phenixbb] Allignment of multiple structures

2015-06-01 Thread Douglas Theobald
THESEUS should be able to do it rather easily.  You can email me offlist if you 
need some guidance.


 On Jun 1, 2015, at 3:53 PM, jens j birktoft birkt...@nyu.edu wrote:
 
 I apologize if this question this question has been asked before but I still 
 need help finding an answer to the following. 
 
 I am looking for a program/web-server that will calculate the superposition 
 of multiple structures (non-protein!)
 
 Thanks
 
 -- 
 +++
 Jens J. Birktoft
 Structural DNA Nanotechnology
 Department of Chemistry
 New York University
 e-mail: jens.kn...@gmail.com; Phone: 212-749-5057
 very slow-mail: 350 Central Park West, Suite 9F, New York, NY 10025
 +++
 ___
 phenixbb mailing list
 pheni...@phenix-online.org
 http://phenix-online.org/mailman/listinfo/phenixbb
 Unsubscribe: phenixbb-le...@phenix-online.org


Re: [ccp4bb] [RANT] Reject Papers describing non-open source software

2015-05-12 Thread Douglas Theobald
On May 12, 2015, at 3:19 PM, Robbie Joosten robbie_joos...@hotmail.com wrote:
 
 I strongly disagree with rejecting paper for any other reasons than
 scientific ones.

I agree, but … one of the foundations of science is independent replicability 
and verifiability.  In practice, for me to be able to replicate and verify your 
computational analysis and results, I will need to be able to see your source 
code, compile it myself, and potentially modify it.  These requirements in 
effect necessitate some sort of open source model, in the broadest sense of the 
term.  To take one of your examples, the Ms-RSL license — I can’t effectively 
replicate and verify your results if I’m legally prohibited from compiling and 
modifying your source code, so the Ms-RSL is out.  

 A paper describing software should properly describe the
 algorithms to ensure the reproducibility.

*Should*.  In practice, we all know (those programmers among us do, anyway) 
that descriptions of source code do not suffice.  

 The source should be available for
 inspection to ensure the program does what was claimed, for all I care this
 can be under the Ms-RSL license or just under good-old copyright. The
 program should preferably be available free for academic users, but if the
 paper is good you should be able to re-implement the tool if it is too
 expensive or doesn't exactly do what you want so it isn't entirely
 necessary. 

 Making the software open source (in an OSS sense) does not solve any
 problems that a good description of the algorithms doesn't do well already.

This is just wildly wrong.  It’s basically impossible to ensure and verify that 
a “good description of the algorithm actually corresponds to the source code 
without seeing, using, and modifying the source.  To take an experimental 
analogy — my lab has endured several cases where we read a “good published 
description of the subcloning and sequencing of some vector, only to find that 
the detailed published description is wrong when we are given the chance to 
analyze the vector ourselves.  It happens all the time, and computer code is no 
different in this respect.  

 OSS does not guarantee long-term availability, a paper will like outlive the
 software repository. OSS licenses (not the BSD license) can be so
 restrictive that you end up having to re-implement the algorithms anyway. So
 not having an OSS license should not be a reason to reject the paper about
 the software.
 
 Cheers,
 Robbie 
 
 -Original Message-
 From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of
 James Stroud
 Sent: Tuesday, May 12, 2015 20:40
 To: CCP4BB@JISCMAIL.AC.UK
 Subject: Re: [ccp4bb] [RANT] Reject Papers describing non-open source
 software
 
 On May 12, 2015, at 12:29 PM, Roger Rowlett rrowl...@colgate.edu
 wrote:
 
 Was the research publicly funded? If you receive funds from NSF, for
 example,
 you are expected to share and make widely available and usable software
 and inventions created under a grant (section VI.D.4. of the Award and
 administration guide). I don't know how enforceable that clause is,
 however.
 
 The funding shouldn't matter. I suggest that a publication that has the
 purpose
 of describing non-open source software should be summarily rejected by
 referees. In other words, the power is in our hands, not the NSF's.


Re: [ccp4bb] ctruncate bug?

2013-07-24 Thread Douglas Theobald
Hi Randy,

So I've been playing around with equations myself, and I have some alternative 
results.  

As I understand your Mathematica stuff, you are using the data model:

ip = ij + ib'

ib

where ip is the measured peak (before any background correction), and ij is a 
random sample from the true intensity j.  Here ib is the measured background, 
whereas ib' is the background absorbed into ip.  Both ib and ib' are a random 
sample from background jb.  Again, only ip and ib are observed; ij and ib' are 
hidden variables.  

Now let me recap your treatment of that model (hopefully I get this right).

You assume Poisson distributions for ip, ij, ib, and ib', and find the joint 
probability of observed ip and ib given j and jb, p(ip,ib|j,jb).  You can 
consider ip and ib as statistically independent, since ip depends on ib', not 
ib.  You then marginalize over jb (the true background intensity) using a flat 
uninformative prior, giving p(ip,ib|j).  You find that p(ip,ib|j) is similar to 
FW's p(ip-ib|j, sdj), where sdj=sqrt(ip+ib).  

Some sort of scaling is necessary, since in practice ib and ip are counted from 
different numbers of pixels.  You find that, for roughly equal scaling, the 
Poisson version is similar to FW's Gaussian approximation for even moderate 
counts.

However, in practice, we measure the background from a much larger area than 
the spot.  For example, in the mosflm window I have open now, the background 
area is  20 times the spot area, for high res, low SNR spots.  Similarly, in 
xds the spot-to-background ratio, in terms of pixel #, is  10 on average and  
5 for the great majority of spots.  Therefore, we typically know the value of 
jb to a much better precision than what we can get from ip (which is 
essentially an estimate of j+jb).  

If the relative sd of the background is about 2 or 3 times less than that of 
the spot ip, we can approximate the background estimate of jb as a constant 
(ie, ignore the uncertainty in its value).  This will be valid if the total 
area used for the background measurement is roughly 5 times the area of the 
spot (even less for negative peaks).  So what we can do is estimate jb using 
ib, and then find the conditional distribution of j given ip and jb.  Using 
your notation, this distribution is given by:

p(j|ip,jb) = exp(-(jb+j)) (jb+j)^ip / Gamma(ip+1,jb)

where Gamma(.,.) is the upper incomplete gamma function.  

The moments of this distribution have nice analytical forms (well, at least as 
nice as FW's).  Here's a table comparing the FW estimates to this Poisson 
treatment, using Randy's ip and jb values, plus some others:

ip   jbExp[j]_fw  SD[j]_fw  h Exp[j]_dt  SD[j]_dt  %diff
   -    ---   -    -
  55   45 11.36.3   1.3   11.9   6.8 5.3
  45   55  3.02.6  -1.53.7   3.3 5.4
  35   65  1.11.1  -5.12.0   2.086
   6   10  1.00.91 -1.61.8   1.780
   13  0.37   0.34 -2.01.3   1.2   240
   4   12  0.45   0.43 -4.01.4   1.3   210 

 100  100  8.06.0   0  8.6   6.6 7.4
  85  100  3.93.4  -1.64.7   4.220
  75  100  2.52.4  -2.93.4   3.235
 500  500 17.8   13.5   0 18.4  14.0 3.3
 440  500  6.25.8  -2.97.0   6.614
1000 1000 25.2   19.1   0 25.8  21   2.3
 920 1000  9.48.8  -2.6   10.3   9.5 9.1
 940 1000 11.6   10.5  -2.0   12.4  11   7

In this table I've used sdj=sqrt(ip) for FW, since I'm ignoring the 
uncertainty in jb --- Randy used sqrt(ip+ib).  

h = (ip-jb)/sdj  

%diff = (Exp[j]_dt - Exp[j]_fw)/Exp[j]_fw  

Here jb is the # background counts normalized to have the same pixel area as 
ip.  

Whether these would be considered important differences, I'm not sure.  The 
differences are greatest when ipjb (that is, for negative intensities).


As an aside:

It's easy to expand this to include the acentric Wilson prior:

p(j|ip,jb,w) = exp(-(jb+j)(w+1)) (jb+j)^ip (w+1)^(ip+1) / Gamma(ip+1,jb(w+1))

where w = 1/sigma_w, sigma_w = the Wilson sigma.  Again, the moments have 
analytical forms.  



On Jul 1, 2013, at 5:47 AM, Randy Read rj...@cam.ac.uk wrote:

 Hi,
 
 I've been following this discussion, and I was particularly interested by the 
 suggestion that some information might be lost by turning the separate peak 
 and background measurements into a single difference.  I accept the point 
 that there might be value in, e.g., TDS models that pay explicit attention to 
 non-Bragg intensities, but this whole discussion started from the point of 
 what estimates to use for diffracted Bragg intensities in processes such as 
 molecular replacement, refinement, and map calculations.
 
 I thought I'd run this past the two of you, in case I've 

Re: [ccp4bb] ctruncate bug?

2013-07-08 Thread Douglas Theobald
On Jul 7, 2013, at 1:44 PM, Ian Tickle ianj...@gmail.com wrote:

 On 29 June 2013 01:13, Douglas Theobald dtheob...@brandeis.edu
 wrote:
 
  I admittedly don't understand TDS well.  But I thought it was
  generally assumed that TDS contributes rather little to the
  conventional background measurement outside of the spot (so Stout
  and Jensen tells me :).  So I was not even really considering TDS,
  which I see as a different problem from measuring background (am I
  mistaken here?).  I thought the background we measure (in the area
  surrounding the spot) mostly came from diffuse solvent scatter, air
  scatter, loop scatter, etc.  If so, then we can just consider Itrue
  = Ibragg + Itds, and worry about modeling the different components
  of Itrue at a different stage.  And then it would make sense to
  think about blocking a reflection (say, with a minuscule, precisely
  positioned beam stop very near the crystal) and measuring the
  background in the spot where the reflection would hit.  That
  background should be approximated pretty well by Iback, the
  background around the spot (especially if we move far enough away
  from the spot so that TDS is negligible there).
 
 Stout  Jensen would not be my first choice to learn about TDS!  It's
 a textbook of small-molecule crystallography (I know, it was my main
 textbook during my doctorate on small-molecule structures), and small
 molecules are generally more highly ordered than macromolecules and
 therefore exhibit TDS on a much smaller scale (there are exceptions of
 course).  I think what you are talking about is acoustic mode TDS
 (so-called because of its relationship with sound transmission through
 a crystal), which peaks under the Bragg spots and is therefore very
 hard to distinguish from it.  The other two contributors to TDS that
 are often observed in MX are optic mode and Einstein model.  TDS
 arises from correlated motions within the crystal, for acoustic mode
 it's correlated motions of whole unit cells within the lattice, for
 optic mode it's correlations of different parts of a unit cell (e.g.
 correlated domain motions in a protein), and for Einstein model it's
 correlations of the movement of electrons as they are carried along by
 vibrating atoms (an Einstein solid is a simple model of a crystal
 proposed by A. Einstein consisting of a collection of independent
 quantised harmonic-isotropic oscillators; I doubt he was aware of its
 relevance to TDS, that came later).  Here's an example of TDS:
 http://people.cryst.bbk.ac.uk/~tickle/iucr99/tds2f.gif .  The acoustic
 mode gives the haloes around the Bragg spots (but as I said mainly
 coincides with the spots), the optic mode gives the nebulous blobs,
 wisps and streaks that are uncorrelated with the Bragg spots (you can
 make out an inner ring of 14 blobs due to the 7-fold NCS), and the
 Einstein model gives the isotropic uniform greying increasing towards
 the outer edge (makes it look like the diffraction pattern has been
 projected onto a sphere).  So I leave you to decide whether TDS
 contributes to the background!

That's all very interesting --- do you have a good ref for TDS where I
can read up on the theory/practice?  My protein xtallography books say
even less than SJ about TDS.  Anyway, this appears to be a problem
beyond the scope of this present discussion --- in an ideal world we'd
be modeling all the forms of TDS, and Bragg diffraction, and comparing
those predictions to the intensity pattern over the entire detector ---
not just integrating near the reciprocal lattice points.  Going on what
you said above, it seems the acoustic component can't really be measured
independently of the Bragg peak, while the optic and Einstein components
can, or least can be estimated pretty well from the intensity around the
Bragg peak (which means we can treat it as background).  In any case,
I'm going to ignore the TDS complications for now. :)

 As for the blocking beam stop, every part of the crystal (or at least
 every part that's in the beam) contributes to every part of the
 diffraction pattern (i.e. Fourier transform).  This means that your
 beam stop would have to mask the whole crystal - any small bit of the
 crystal left unmasked and exposed to the beam would give a complete
 diffraction pattern!  That means you wouldn't see anything, not even
 the background!  

That's all true, but you can detect peaks independently of one another
on a detector, so obviously there is some minimal distance away from a
crystal where you could completely block any given reflection and
nothing else. Clearly the reflection stop would have to be the size of
the crystal (or at least the beam).

 You could leave a small hole in the centre for the direct beam and
 that would give you the air scatter contribution, but usually the air
 path is minimal anyway so that's only a very small contribution to the
 total background.  But let's say by some magic you were able to
 measure only the background, say

Re: [ccp4bb] ctruncate bug?

2013-06-28 Thread Douglas Theobald
On Jun 27, 2013, at 12:30 PM, Ian Tickle ianj...@gmail.com wrote:

 On 22 June 2013 19:39, Douglas Theobald dtheob...@brandeis.edu wrote:
 
 So I'm no detector expert by any means, but I have been assured by those who 
 are that there are non-Poissonian sources of noise --- I believe mostly in 
 the readout, when photon counts get amplified.  Of course this will depend 
 on the exact type of detector, maybe the newest have only Poisson noise.
 
 Sorry for delay in responding, I've been thinking about it.  It's indeed 
 possible that the older detectors had non-Poissonian noise as you say, but 
 AFAIK all detectors return _unsigned_ integers (unless possibly the number is 
 to be interpreted as a flag to indicate some error condition, but then 
 obviously you wouldn't interpret it as a count).  So whatever the detector 
 AFAIK it's physically impossible for it to return a negative number that is 
 to be interpreted as a photon count (of course the integration program may 
 interpret the count as a _signed_ integer but that's purely a technical 
 software issue).  

Just because the detectors spit out positive numbers (unsigned ints) does not 
mean that those values are Poisson distributed.  As I understand it, the 
readout can introduce non-Poisson noise, which is usually modeled as Gaussian.  

 I think we're all at least agreed that, whatever the true distribution of 
 Ispot (and Iback) is, it's not in general Gaussian, except as an 
 approximation in the limit of large Ispot and Iback (with the proviso that 
 under this approximation Ispot  Iback can never be negative).  Certainly the 
 assumption (again AFAIK) has always been that var(count) = count and I think 
 I'm right in saying that only a Poisson distribution has that property?

I think you mean that the Poisson has the property that mean(x) = var(x) (and 
since the ML estimate of the mean = count, you get your equation).  Many other 
distributions can approximate that (most of the binomial variants with small 
p).  Also, the standard gamma distribution with scale parameter=1 has that 
exact property.  

 No, its just terminology.  For you, Iobs is defined as Ispot-Iback, and 
 that's fine.  (As an aside, assuming the Poisson model, this Iobs will have 
 a Skellam distribution, which can take negative values and asymptotically 
 approaches a Gaussian.)  The photons contributed to Ispot from Itrue will 
 still be Poisson.  Let's call them something besides Iobs, how about Ireal?  
 Then, the Poisson model is
 
 Ispot = Ireal + Iback'
 
 where Ireal comes from a Poisson with mean Itrue, and Iback' comes from a 
 Poisson with mean Iback_true.  The same likelihood function follows, as well 
 as the same points.  You're correct that we can't directly estimate Iback', 
 but I assume that Iback (the counts around the spot) come from the same 
 Poisson with mean Iback_true (as usual).  
 
 So I would say, sure, you have defined Iobs, and it has a Skellam 
 distribution, but what, if anything, does that Iobs have to do with Itrue?  
 My point still holds, that your Iobs is not a valid estimate of Itrue when 
 IspotIback.  Iobs as an estimate of Itrue requires unphysical assumptions, 
 namely that photon counts can be negative.  It is impossible to derive 
 Ispot-Iback as an estimate for Itrue (when IspotIback) *unless* you make 
 that unphysical assumption (like the Gaussian model).
 
 Please note that I have never claimed that Iobs = Ispot - Iback is to be 
 interpreted as an estimate of Itrue, indeed quite the opposite: I agree 
 completely that Iobs has little to do with Itrue when Iobs is negative.  In 
 fact I don't believe anyone else is claiming that Iobs is to be interpreted 
 as an estimate of Itrue either, so maybe this is the source of the 
 misunderstanding?  

Maybe it is, but that has its own problems.  I imagine that most people who 
collect an X-ray dataset think that the intensities in their mtz are indeed 
estimates of the true intensities from their crystal.  Seems like a reasonable 
thing to expect, especially since the fourier of our model is supposed to 
predict Itrue.  If Iobs is not an estimate of Itrue, what exactly is its 
relevance to the structure inference problem?  Maybe it only serves as a 
way-station on the road to the French-Wilson correction?  As I understand it, 
not everyone uses ctruncate.  

 Certainly for me Ispot - Iback is merely the difference between the two 
 measurements, nothing more.  Maybe if we called it something other than Iobs 
 (say Idiff), or even avoided giving it a name altogether that would avoid any 
 further confusion?  Perhaps this whole discussion has been merely about 
 terminology?
  
 I'm also puzzled as to your claim that Iback' is not Poisson.  I don't think 
 your QM argument is relevant, since we can imagine what we would have 
 detected at the spot if we'd blocked the reflection, and that # of photon 
 counts would be Poisson.  That is precisely the conventional logic behind

Re: [ccp4bb] ctruncate bug?

2013-06-22 Thread Douglas Theobald
Ian, I really do think we are almost saying the same thing.  Let me try to
clarify.

You say that the Gaussian model is not the correct data model, and that
the Poisson is correct.  I more-or-less agree.  If I were being pedantic
(me?) I would say that the Poisson is *more* physically realistic than the
Gaussian, and more realistic in a very important and relevant way --- but
in truth the Poisson model does not account for other physical sources of
error that arise from real crystals and real detectors, such as dark noise
and read noise (that's why I would prefer a gamma distribution).  I also
agree that for x10 the Gaussian is a good approximation to the Poisson.  I
basically agree with every point you make about the Poisson vs the
Gaussian, except for the following.

The Iobs=Ispot-Iback equation cannot be derived from a Poisson assumption,
except as an approximation when  Ispot  Iback.  It *can* be derived from
the Gaussian assumption (and in fact I think that is probably the *only*
justification it has).   It is true that the difference between two
Poissons can be negative.  It is also true that for moderate # of counts,
the Gaussian is a good approximation to the Poisson.  But we are trying to
estimate Itrue, and both of those points are irrelevant to estimating Itrue
when Ispot  Iback.  Contrary to your assertion, we are not concerned with
differences of Poissonians, only sums.  Here is why:

In the Poisson model you outline, Ispot is the sum of two Poisson
variables, Iback and Iobs.  That means Ispot is also Poisson and can never
be negative.  Again --- the observed data (Ispot) is a *sum*, so that is
what we must deal with.  The likelihood function for this model is:

L(a) = (a+b)^k exp(-a-b)

where 'k' is the # of counts in Ispot, 'a' is the mean of the Iobs Poisson
(i.e., a = Itrue), and 'b' is the mean of the Iback Poisson.  Of course
k=0, and both parameters a0 and b0.  Our job is to estimate 'a', Itrue.
 Given the likelihood function above, there is no valid estimate of 'a'
that will give a negative value.  For example, the ML estimate of 'a' is
always non-negative.  Specifically, if we assume 'b' is known from
background extrapolation, the ML estimate of 'a' is:

a = k-b   if kb

a = 0   if k=b

You can verify this visually by plotting the likelihood function (vs 'a' as
variable) for any combination of k and b you want.  The SD is a bit more
difficult, but it is approximately (a+b)/sqrt(k), where 'a' is now the ML
estimate of 'a'.

Note that the ML estimate of 'a', when kb (IspotIback), is equivalent to
Ispot-Iback.

Now, to restate:  as an estimate of Itrue, Ispot-Iback cannot be derived
from the Poisson model.  In contrast, Ispot-Iback *can* be derived from a
Gaussian model (as the ML and LS estimate of Itrue).  In fact, I'll wager
the Gaussian is the only reasonable model that gives Ispot-Iback as an
estimate of Itrue.  This is why I claim that using Ispot-Iback as an
estimate of Itrue, even when IspotIback, implicitly means you are using a
(non-physical) Gaussian model.  Feel free to prove me wrong --- can you
derive Ispot-Iback, as an estimate of Itrue, from anything besides a
Gaussian?

Cheers,

Douglas




On Sat, Jun 22, 2013 at 12:06 PM, Ian Tickle ianj...@gmail.com wrote:

 On 21 June 2013 19:45, Douglas Theobald dtheob...@brandeis.edu wrote:


 The current way of doing things is summarized by Ed's equation:
 Ispot-Iback=Iobs.  Here Ispot is the # of counts in the spot (the area
 encompassing the predicted reflection), and Iback is # of counts in the
 background (usu. some area around the spot).  Our job is to estimate the
 true intensity Itrue.  Ed and others argue that Iobs is a reasonable
 estimate of Itrue, but I say it isn't because Itrue can never be negative,
 whereas Iobs can.

 Now where does the Ispot-Iback=Iobs equation come from?  It implicitly
 assumes that both Iobs and Iback come from a Gaussian distribution, in
 which Iobs and Iback can have negative values.  Here's the implicit data
 model:

 Ispot = Iobs + Iback

 There is an Itrue, to which we add some Gaussian noise and randomly
 generate an Iobs.  To that is added some background noise, Iback, which is
 also randomly generated from a Gaussian with a true mean of Ibtrue.  This
 gives us the Ispot, the measured intensity in our spot.  Given this data
 model, Ispot will also have a Gaussian distribution, with mean equal to the
 sum of Itrue + Ibtrue.  From the properties of Gaussians, then, the ML
 estimate of Itrue will be Ispot-Iback, or Iobs.


 Douglas, sorry I still disagree with your model.  Please note that I do
 actually support your position, that Ispot-Iback is not the best estimate
 of Itrue.  I stress that I am not arguing against this conclusion, merely
 (!) with your data model, i.e. you are arriving at the correct conclusion
 despite using the wrong model!  So I think it's worth clearing that up.

 First off, I can assure you that there is no assumption, either implicit
 or explicit, that Ispot

Re: [ccp4bb] ctruncate bug?

2013-06-22 Thread Douglas Theobald
On Sat, Jun 22, 2013 at 1:04 PM, Douglas Theobald dtheob...@brandeis.eduwrote:

 Feel free to prove me wrong --- can you derive Ispot-Iback, as an estimate
 of Itrue, from anything besides a Gaussian?


OK, I'll prove myself wrong.   Ispot-Iback can be derived as an estimate of
Itrue, even when IspotIback, assuming a logistic model, laplace model, and
probably others that allow negative values.  I doubt anyone cares about
these more exotic models, the point is that to get Ispot-Iback as an
estimate of Itrue when IspotIback requires a non-physical model that
allows negative photon counts and intensities.


Re: [ccp4bb] ctruncate bug?

2013-06-22 Thread Douglas Theobald
On Sat, Jun 22, 2013 at 1:56 PM, Ian Tickle ianj...@gmail.com wrote:

 On 22 June 2013 18:04, Douglas Theobald dtheob...@brandeis.edu wrote:

  --- but in truth the Poisson model does not account for other physical
 sources of error that arise from real crystals and real detectors, such as
 dark noise and read noise (that's why I would prefer a gamma distribution).


 A photon counter is a digital device, not an analogue one.  It starts at
 zero and adds 1 every time it detects a photon (or what it thinks is a
 photon).  Once added, it is physically impossible for it to subtract 1 from
 its accumulated count: it contains no circuit to do that.  It can certainly
 miss photons, so you end up with less than you should, and it can certainly
 'see' photons where there were none (e.g. from instrumental noise), so you
 end up with more than you should.  However once a count has been
 accumulated in the digital memory it stays there until the memory is
 cleared for the next measurement, and you can never end up with less than
 that accumulated count and in particular not less than zero; the bits of
 memory where the counts are accumulated are simply not programmed to return
 negative numbers.  It has nothing to do with whether the crystal is real or
 not, all that matters is that photons from somewhere are arriving at and
 being counted by the detector.  The accumulated counts at any moment in
 time have a Poisson distribution since the photons arrive completely
 randomly in time.


I might add that if you are correct --- that the naive Poisson model is
appropriate (perhaps true for the latest and greatest detectors, evidently
Pilatus has no read-out noise or dark current) --- then the ML solution I
outlined is a good one (much better than the crude Ispot-Iback background
subtraction), and it provides rigorous SD estimates too.


Re: [ccp4bb] ctruncate bug?

2013-06-22 Thread Douglas Theobald
On Jun 22, 2013, at 6:18 PM, Frank von Delft frank.vonde...@sgc.ox.ac.uk 
wrote:

 A fascinating discussion (I've learnt a lot!);  a quick sanity check, though: 
 
 In what scenarios would these improved estimates make a significant 
 difference?  

Who knows?  I always think that improved estimates are always a good thing, 
ignoring computational complexity (by improved I mean making more accurate 
physical assumptions).  This may all be academic --- estimating Itrue with 
unphysical negative values, and then later correcting w/French-Wilson, may give 
approximately the same answers and make no tangible difference in the models.  
But that all seems a bit convoluted, ad hoc, and unnecessary, esp. now with the 
available computational power.  It might make a difference.  

 Or rather:  are there any existing programs (as opposed to vapourware) that 
 would benefit significantly?
 
 Cheers
 phx
 
 
 
 On 22/06/2013 18:04, Douglas Theobald wrote:
 Ian, I really do think we are almost saying the same thing.  Let me try to 
 clarify.
 
 You say that the Gaussian model is not the correct data model, and that 
 the Poisson is correct.  I more-or-less agree.  If I were being pedantic 
 (me?) I would say that the Poisson is *more* physically realistic than the 
 Gaussian, and more realistic in a very important and relevant way --- but in 
 truth the Poisson model does not account for other physical sources of error 
 that arise from real crystals and real detectors, such as dark noise and 
 read noise (that's why I would prefer a gamma distribution).  I also agree 
 that for x10 the Gaussian is a good approximation to the Poisson.  I 
 basically agree with every point you make about the Poisson vs the Gaussian, 
 except for the following.
 
 The Iobs=Ispot-Iback equation cannot be derived from a Poisson assumption, 
 except as an approximation when  Ispot  Iback.  It *can* be derived from 
 the Gaussian assumption (and in fact I think that is probably the *only* 
 justification it has).   It is true that the difference between two Poissons 
 can be negative.  It is also true that for moderate # of counts, the 
 Gaussian is a good approximation to the Poisson.  But we are trying to 
 estimate Itrue, and both of those points are irrelevant to estimating Itrue 
 when Ispot  Iback.  Contrary to your assertion, we are not concerned with 
 differences of Poissonians, only sums.  Here is why:
 
 In the Poisson model you outline, Ispot is the sum of two Poisson variables, 
 Iback and Iobs.  That means Ispot is also Poisson and can never be negative. 
  Again --- the observed data (Ispot) is a *sum*, so that is what we must 
 deal with.  The likelihood function for this model is:
 
 L(a) = (a+b)^k exp(-a-b)
 
 where 'k' is the # of counts in Ispot, 'a' is the mean of the Iobs Poisson 
 (i.e., a = Itrue), and 'b' is the   mean of the Iback Poisson.  Of 
 course k=0, and both parameters a0 and b0.  Our job is to estimate 'a', 
 Itrue.  Given the likelihood function above, there is no valid estimate of 
 'a' that will give a negative value.  For example, the ML estimate of 'a' is 
 always non-negative.  Specifically, if we assume 'b' is known from 
 background extrapolation, the ML estimate of 'a' is:
 
 a = k-b   if kb
 
 a = 0   if k=b
 
 You can verify this visually by plotting the likelihood function (vs 'a' as 
 variable) for any combination of k and b you want.  The SD is a bit more 
 difficult, but it is approximately (a+b)/sqrt(k), where 'a' is now the ML 
 estimate of 'a'.  
 
 Note that the ML estimate of 'a', when kb (IspotIback), is equivalent to 
 Ispot-Iback.  
 
 Now, to restate:  as an estimate of Itrue, Ispot-Iback cannot be derived 
 from the Poisson model.  In contrast, Ispot-Iback *can* be derived from a 
 Gaussian model (as the ML and LS estimate of Itrue).  In fact, I'll wager 
 the Gaussian is the only reasonable model that gives Ispot-Iback as an 
 estimate of Itrue.  This is why I claim that using Ispot-Iback as an 
 estimate of Itrue, even when IspotIback, implicitly means you are using a 
 (non-physical) Gaussian model.  Feel free to prove me wrong --- can you 
 derive Ispot-Iback, as an estimate of Itrue, from anything besides a 
 Gaussian?
 
 Cheers,
 
 Douglas
 
 
 
 
 On Sat, Jun 22, 2013 at 12:06 PM, Ian Tickle ianj...@gmail.com wrote:
 On 21 June 2013 19:45, Douglas Theobald dtheob...@brandeis.edu wrote:
 
 The current way of doing things is summarized by Ed's equation: 
 Ispot-Iback=Iobs.  Here Ispot is the # of counts in the spot (the area 
 encompassing the predicted reflection), and Iback is # of counts in the 
 background (usu. some area around the spot).  Our job is to estimate the 
 true intensity Itrue.  Ed and others argue that Iobs is a reasonable 
 estimate of Itrue, but I say it isn't because Itrue can never be negative, 
 whereas Iobs can.
 
 Now where does the Ispot-Iback=Iobs equation come from?  It implicitly 
 assumes that both Iobs and Iback come from a Gaussian

Re: [ccp4bb] ctruncate bug?

2013-06-21 Thread Douglas Theobald
On Jun 21, 2013, at 8:36 AM, Ed Pozharski epozh...@umaryland.edu wrote:

 On 06/20/2013 01:07 PM, Douglas Theobald wrote:
 How can there be nothing wrong with something that is unphysical?  
 Intensities cannot be negative.
 
 I think you are confusing two things - the true intensities and observed 
 intensities.

But I'm not.  Let me try to convince you ...

 True intensities represent the number of photons that diffract off a crystal 
 in a specific direction or, for QED-minded, relative probabilities of a 
 single photon being found in a particular area of the detector when it's 
 probability wave function finally collapses.

I agree. 

 True intensities certainly cannot be negative and in crystallographic method 
 they never are. They are represented by the best theoretical estimates 
 possible, Icalc.  These are always positive.

I also very much agree.  

 Observed intensities are the best estimates that we can come up with in an 
 experiment.  

I also agree with this, and this is the clincher.  You are arguing that 
Ispot-Iback=Iobs is the best estimate we can come up with.  I claim that is 
absurd.  How are you quantifying best?  Usually we have some sort of 
discrepancy measure between true and estimate, like RMSD, mean absolute 
distance, log distance, or somesuch.  Here is the important point --- by any 
measure of discrepancy you care to use, the person who estimates Iobs as 0 when 
IbackIspot will *always*, in *every case*, beat the person who estimates Iobs 
with a negative value.   This is an indisputable fact.  

 These are determined by integrating pixels around the spot where particular 
 reflection is expected to hit the detector.  Unfortunately, science did not 
 yet invent a method that would allow to suspend a crystal in vacuum while 
 also removing all of the outside solvent.  Neither we have included diffuse 
 scatter in our theoretical model.  Because of that, full reflection intensity 
 contains background signal in addition to the Icalc.  This background has to 
 be subtracted and what is perhaps the most useful form of observation is 
 Ispot-Iback=Iobs.

How can that be the most useful form, when 0 is always a better estimate than a 
negative value, by any criterion?

 These observed intensities can be negative because while their true 
 underlying value is positive, random errorsmay result in IbackIspot.  There 
 is absolutely nothing unphysical here.

Yes there is.  The only way you can get a negative estimate is to make 
unphysical assumptions.  Namely, the estimate Ispot-Iback=Iobs assumes that 
both the true value of I and the background noise come from a Gaussian 
distribution that is allowed to have negative values.  Both of those 
assumptions are unphysical.  

 Replacing Iobs with E(J) is not only unnecessary, it's ill-advised as it will 
 distort intensity statistics.  For example, let's say you have translational 
 NCS aligned with crystallographic axes, and hence some set of reflections is 
 systematically absent.  If all is well, Iobs~0 for the subset while E(J) 
 is systematically positive.  This obviously happens because the standard 
 Wilson prior is wrong for these reflections, but I digress, as usual.
 
 In summary, there is indeed nothing wrong, imho, with negative Iobs.  The 
 fact that some of these may become negative is correctly accounted for once 
 sigI is factored into the ML target.
 
 Cheers,
 
 Ed.
 
 -- 
 Oh, suddenly throwing a giraffe into a volcano to make water is crazy?
Julian, King of Lemurs
 


Re: [ccp4bb] ctruncate bug?

2013-06-21 Thread Douglas Theobald
I kinda think we're saying the same thing, sort of.

You don't like the Gaussian assumption, and neither do I.  If you make the 
reasonable Poisson assumptions, then you don't get the Ispot-Iback=Iobs for the 
best estimate of Itrue.  Except as an approximation for large values, but we 
are talking about the case when IbackIspot, where the Gaussian approximation 
to the Poisson no longer holds.  The sum of two Poisson variates is also 
Poisson, which also can never be negative, unlike the Gaussian.  

So I reiterate: the Ispot-Iback=Iobs equation assumes Gaussians and hence 
negativity.  The Ispot-Iback=Iobs does not follow from a Poisson assumption.  


On Jun 21, 2013, at 1:13 PM, Ian Tickle ianj...@gmail.com wrote:

 On 21 June 2013 17:10, Douglas Theobald dtheob...@brandeis.edu wrote:
 Yes there is.  The only way you can get a negative estimate is to make 
 unphysical assumptions.  Namely, the estimate Ispot-Iback=Iobs assumes that 
 both the true value of I and the background noise come from a Gaussian 
 distribution that is allowed to have negative values.  Both of those 
 assumptions are unphysical.
 
 Actually that's not correct: Ispot and Iback are both assumed to come from a 
 _Poisson_ distribution which by definition is zero for negative values of its 
 argument (you can't have a negative number of photons), so are _not_ allowed 
 to have negative values.  For large values of the argument (in fact the 
 approximation is pretty good even for x ~ 10) a Poisson approximates to a 
 Gaussian, and then of course the difference Ispot-Iback is also approximately 
 Gaussian.
 
 But I think that doesn't affect your argument.
 
 Cheers
 
 -- Ian 


Re: [ccp4bb] ctruncate bug?

2013-06-21 Thread Douglas Theobald
On Jun 20, 2013, at 2:13 PM, Ian Tickle ianj...@gmail.com wrote:

 Douglas, I think you are missing the point that estimation of the parameters 
 of the proper Bayesian statistical model (i.e. the Wilson prior) in order to 
 perform the integration in the manner you are suggesting requires knowledge 
 of the already integrated intensities!  

Well, that's true, but that's how FW do it.  They allow for negative integrated 
intensities.  I'm arguing that we should not do that, since true intensities 
are positive, and so any estimate of them should also be positive.  

Examples are always better than words, so here goes (and I apologize for the 
length):

The current way of doing things is summarized by Ed's equation: 
Ispot-Iback=Iobs.  Here Ispot is the # of counts in the spot (the area 
encompassing the predicted reflection), and Iback is # of counts in the 
background (usu. some area around the spot).  Our job is to estimate the true 
intensity Itrue.  Ed and others argue that Iobs is a reasonable estimate of 
Itrue, but I say it isn't because Itrue can never be negative, whereas Iobs 
can.  

Now where does the Ispot-Iback=Iobs equation come from?  It implicitly assumes 
that both Iobs and Iback come from a Gaussian distribution, in which Iobs and 
Iback can have negative values.  Here's the implicit data model:

Ispot = Iobs + Iback

There is an Itrue, to which we add some Gaussian noise and randomly generate an 
Iobs.  To that is added some background noise, Iback, which is also randomly 
generated from a Gaussian with a true mean of Ibtrue.  This gives us the 
Ispot, the measured intensity in our spot.  Given this data model, Ispot will 
also have a Gaussian distribution, with mean equal to the sum of Itrue + 
Ibtrue.  From the properties of Gaussians, then, the ML estimate of Itrue will 
be Ispot-Iback, or Iobs.  

Now maybe you disagree with that Gaussian data model.  If so, welcome to my 
POV.  

There are better models, ones that don't give Ispot-Iback as our best estimate 
of Itrue.

Here is a simple example that incorporates our knowledge that Itrue cannot be 
negative (this example is primarily for illustrating the point, it's not 
exactly what I would recommend).  Instead of using Gaussians, we will use Gamma 
distributions, which cannot be negative.  

We assume Iobs is distributed according to a Gamma(Itrue,1).  The mean of this 
distribution is Itrue.  (The Maxwell-Boltzmann energy distribution is also a 
gamma, just for comparison).  

We also assume that the noise is exponential (a special case of the gamma), 
Gamma(1,1).  The mean of this distribution is 1.  (You could imagine that 
you've normalized Ispot relative to its background --- again, just for ease of 
calculation).  

We still assume that Ispot = Iobs + Iback.  Then, Ispot will also have a gamma 
distribution, Gamma(Itrue+1,1).  The mean of the Ispot distribution, as you 
might expect, is Itrue+1.  

Now we measure Ispot.  Given Ispot, the ML estimate of Itrue is:

InvDiGamma[ln(Ispot)]-1   if Ispot0.561
or
0   if Ispot0.561

Note, the ML estimate is no longer Iobs, and the ML estimate cannot be 
negative.  InvDiGamma is the inverse Digamma function --- a bit unusual, but 
easily calculated (actually no weirder than the exponential or logarithm, its a 
relative of factorial and the gamma function).  Not something the Braggs 
would've used, but hey, we've got iPhones now.  We can also estimate the SD of 
of our estimate, but I won't bore you with the equation.

A few examples: 

Ispot   ML Itrue   SD
-      --
0.5 0  0.78
0.6 0.04   0.80
0.8 0.25   0.91
0.9 0.36   0.97
1.0 0.46   1.0
1.5 0.97   1.2
2.0 1.48   1.4
3.0 2.49   1.7
5.0 4.49   2.2
10.09.50   3.2
20.019.5   4.5
100 99.5   10

Note that the first four entries in the table are the case when IspotIback.  
No negative estimates.  You'd get qualitatively similar results if you assume 
Poisson for Iback and Ispot.  

To sum up --- the equation Ispot-Iback=Iobs is unphysical because it is founded 
on unphysical assumptions.  If you make better physical assumptions (i.e., 
Itrue cannot be negative), you end up with different estimates for Itrue.  


 I suppose we could iterate, i.e. assume an approximate prior, integrate, 
 calculate a better prior, re-do the integration with the new prior and so on 
 (hoping of course that the whole process converges), but I think most people 
 would regard that as overkill.  Also dealing with the issue of averaging 
 estimates of intensities that no longer have a Gaussian error distribution, 
 and also crucially outlier rejection, would require some rethinking of the 
 algorithms. The question is would it make any difference in the end compared 
 with the 'post-correction' we're doing now?
 
 Cheers
 
 -- Ian
 
 
 On 20 June 2013 18:14, Douglas Theobald dtheob...@brandeis.edu wrote:
 I still don't see how you get a negative

Re: [ccp4bb] ctruncate bug?

2013-06-21 Thread Douglas Theobald
On Jun 21, 2013, at 2:48 PM, Ed Pozharski epozh...@umaryland.edu wrote:

 Douglas,
 Observed intensities are the best estimates that we can come up with in an 
 experiment.
 I also agree with this, and this is the clincher.  You are arguing that 
 Ispot-Iback=Iobs is the best estimate we can come up with.  I claim that is 
 absurd.  How are you quantifying best?  Usually we have some sort of 
 discrepancy measure between true and estimate, like RMSD, mean absolute 
 distance, log distance, or somesuch.  Here is the important point --- by any 
 measure of discrepancy you care to use, the person who estimates Iobs as 0 
 when IbackIspot will *always*, in *every case*, beat the person who 
 estimates Iobs with a negative value.   This is an indisputable fact.
 
 First off, you may find it useful to avoid such words as absurd and 
 indisputable fact.  I know political correctness may be sometimes overrated, 
 but if you actually plan to have meaningful discussion, let's assume that 
 everyone responding to your posts is just trying to help figure this out.

I apologize for offending and using the strong words --- my intention was not 
to offend.  This is just how I talk when brainstorming with my colleagues 
around a blackboard, but of course then you can see that I smile when I say it. 
 

 To address your point, you are right that J=0 is closer to true intensity 
 then a negative value.  The problem is that we are not after a single 
 intensity, but rather all of them, as they all contribute to electron density 
 reconstruction.  If you replace negative Iobs with E(J), you would 
 systematically inflate the averages, which may turn problematic in some 
 cases.  

So, I get the point.  But even then, using any reasonable criterion, the whole 
estimated dataset will be closer to the true data if you set all negative 
intensity estimates to 0.  

 It is probably better to stick with raw intensities and construct 
 theoretical predictions properly to account for their properties.
 
 What I was trying to tell you is that observed intensities is what we get 
 from experiment.  

But they are not what you get from the detector.  The detector spits out a 
positive value for what's inside the spot.  It is we, as human agents, who 
later manipulate and massage that data value by subtracting the background 
estimate.  A value that has been subjected to a crude background subtraction is 
not the raw experimental value.  It has been modified, and there must be some 
logic to why we massage the data in that particular manner.  I agree, of 
course, that the background should be accounted for somehow.  But why just 
subtract it away?  There are other ways to massage the data --- see my other 
post to Ian.  My argument is that however we massage the experimentally 
observed value should be physically informed, and allowing negative intensity 
estimates violates the basic physics.  

[snip]

 These observed intensities can be negative because while their true 
 underlying value is positive, random errorsmay result in IbackIspot.  
 There is absolutely nothing unphysical here.
 Yes there is.  The only way you can get a negative estimate is to make 
 unphysical assumptions.  Namely, the estimate Ispot-Iback=Iobs assumes that 
 both the true value of I and the background noise come from a Gaussian 
 distribution that is allowed to have negative values.  Both of those 
 assumptions are unphysical.
 
 See, I have a problem with this.  Both common sense and laws of physics 
 dictate that number of photons hitting spot on a detector is a positive 
 number.  There is no law of physics that dictates that under no circumstances 
 there could be IspotIback.  

That's not what I'm saying.  Sure, Ispot can be less than Iback randomly.  That 
does not mean we have to estimate the detected intensity as negative, after 
accounting for background.

 Yes, E(Ispot)=E(Iback).  Yes, E(Ispot-Iback)=0.  But P(Ispot-Iback=0)0, 
 and therefore experimental sampling of Ispot-Iback is bound to occasionally 
 produce negative values.  What law of physics is broken when for a given 
 reflection total number of photons in spot pixels is less that total number 
 of photons in equal number of pixels in the surrounding background mask?
 
 Cheers,
 
 Ed.
 
 -- 
 Oh, suddenly throwing a giraffe into a volcano to make water is crazy?
Julian, King of Lemurs


Re: [ccp4bb] ctruncate bug?

2013-06-21 Thread Douglas Theobald
On Jun 21, 2013, at 2:52 PM, James Holton jmhol...@lbl.gov wrote:

 Yes, but the DIFFERENCE between two Poisson-distributed values can be 
 negative.  This is, unfortunately, what you get when you subtract the 
 background out from under a spot.  Perhaps this is the source of confusion 
 here?

Maybe, but if you assume Poisson background and intensities, the ML estimate 
when background  measured intensity is not negative, nor is it the difference 
Ispot-Iback.  The ML estimate is 0.  (With a finite non-zero SD, smaller SD the 
smaller the Ispot/Iback ratio).

 On Fri, Jun 21, 2013 at 11:34 AM, Douglas Theobald dtheob...@brandeis.edu 
 wrote:
 I kinda think we're saying the same thing, sort of.
 
 You don't like the Gaussian assumption, and neither do I.  If you make the 
 reasonable Poisson assumptions, then you don't get the Ispot-Iback=Iobs for 
 the best estimate of Itrue.  Except as an approximation for large values, but 
 we are talking about the case when IbackIspot, where the Gaussian 
 approximation to the Poisson no longer holds.  The sum of two Poisson 
 variates is also Poisson, which also can never be negative, unlike the 
 Gaussian.
 
 So I reiterate: the Ispot-Iback=Iobs equation assumes Gaussians and hence 
 negativity.  The Ispot-Iback=Iobs does not follow from a Poisson assumption.
 
 
 On Jun 21, 2013, at 1:13 PM, Ian Tickle ianj...@gmail.com wrote:
 
  On 21 June 2013 17:10, Douglas Theobald dtheob...@brandeis.edu wrote:
  Yes there is.  The only way you can get a negative estimate is to make 
  unphysical assumptions.  Namely, the estimate Ispot-Iback=Iobs assumes 
  that both the true value of I and the background noise come from a 
  Gaussian distribution that is allowed to have negative values.  Both of 
  those assumptions are unphysical.
 
  Actually that's not correct: Ispot and Iback are both assumed to come from 
  a _Poisson_ distribution which by definition is zero for negative values of 
  its argument (you can't have a negative number of photons), so are _not_ 
  allowed to have negative values.  For large values of the argument (in fact 
  the approximation is pretty good even for x ~ 10) a Poisson approximates to 
  a Gaussian, and then of course the difference Ispot-Iback is also 
  approximately Gaussian.
 
  But I think that doesn't affect your argument.
 
  Cheers
 
  -- Ian
 


Re: [ccp4bb] ctruncate bug?

2013-06-20 Thread Douglas Theobald
Just trying to understand the basic issues here.  How could refining directly 
against intensities solve the fundamental problem of negative intensity values?


On Jun 20, 2013, at 11:34 AM, Bernhard Rupp hofkristall...@gmail.com wrote:

 As a maybe better alternative, we should (once again) consider to refine 
 against intensities (and I guess George Sheldrick would agree here).
 
 I have a simple question - what exactly, short of some sort of historic 
 inertia (or memory lapse), is the reason NOT to refine against intensities? 
 
 Best, BR


Re: [ccp4bb] ctruncate bug?

2013-06-20 Thread Douglas Theobald
Seems to me that the negative Is should be dealt with early on, in the 
integration step.  Why exactly do integration programs report negative Is to 
begin with?


On Jun 20, 2013, at 12:45 PM, Dom Bellini dom.bell...@diamond.ac.uk wrote:

 Wouldnt be possible to take advantage of negative Is to extrapolate/estimate 
 the decay of scattering background (kind of Wilson plot of background 
 scattering) to flat out the background and push all the Is to positive values?
 
 More of a question rather than a suggestion ...
 
 D
 
 
 
 From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Ian 
 Tickle
 Sent: 20 June 2013 17:34
 To: ccp4bb
 Subject: Re: [ccp4bb] ctruncate bug?
 
 Yes higher R factors is the usual reason people don't like I-based refinement!
 
 Anyway, refining against Is doesn't solve the problem, it only postpones it: 
 you still need the Fs for maps! (though errors in Fs may be less critical 
 then).
 -- Ian
 
 On 20 June 2013 17:20, Dale Tronrud 
 det...@uoxray.uoregon.edumailto:det...@uoxray.uoregon.edu wrote:
   If you are refining against F's you have to find some way to avoid
 calculating the square root of a negative number.  That is why people
 have historically rejected negative I's and why Truncate and cTruncate
 were invented.
 
   When refining against I, the calculation of (Iobs - Icalc)^2 couldn't
 care less if Iobs happens to be negative.
 
   As for why people still refine against F...  When I was distributing
 a refinement package it could refine against I but no one wanted to do
 that.  The R values ended up higher, but they were looking at R
 values calculated from F's.  Of course the F based R values are lower
 when you refine against F's, that means nothing.
 
   If we could get the PDB to report both the F and I based R values
 for all models maybe we could get a start toward moving to intensity
 refinement.
 
 Dale Tronrud
 
 
 On 06/20/2013 09:06 AM, Douglas Theobald wrote:
 Just trying to understand the basic issues here.  How could refining directly 
 against intensities solve the fundamental problem of negative intensity 
 values?
 
 
 On Jun 20, 2013, at 11:34 AM, Bernhard Rupp 
 hofkristall...@gmail.commailto:hofkristall...@gmail.com wrote:
 As a maybe better alternative, we should (once again) consider to refine 
 against intensities (and I guess George Sheldrick would agree here).
 
 I have a simple question - what exactly, short of some sort of historic 
 inertia (or memory lapse), is the reason NOT to refine against intensities?
 
 Best, BR
 
 
 
 
 -- 
 
 This e-mail and any attachments may contain confidential, copyright and or 
 privileged material, and are for the use of the intended addressee only. If 
 you are not the intended addressee or an authorised recipient of the 
 addressee please notify us of receipt by returning the e-mail and do not use, 
 copy, retain, distribute or disclose the information in or attached to the 
 e-mail.
 
 Any opinions expressed within this e-mail are those of the individual and not 
 necessarily of Diamond Light Source Ltd. 
 
 Diamond Light Source Ltd. cannot guarantee that this e-mail or any 
 attachments are free from viruses and we cannot accept liability for any 
 damage which you may sustain as a result of software viruses which may be 
 transmitted in or with the message.
 
 Diamond Light Source Limited (company no. 4375679). Registered in England and 
 Wales with its registered office at Diamond House, Harwell Science and 
 Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 
 
 
 
 
 
 
 
 


Re: [ccp4bb] ctruncate bug?

2013-06-20 Thread Douglas Theobald
How can there be nothing wrong with something that is unphysical?  
Intensities cannot be negative.  How could you measure a negative number of 
photons?  You can only have a Gaussian distribution around I=0 if you are using 
an incorrect, unphysical statistical model.  As I understand it, the physics 
predicts that intensities from diffraction should be gamma distributed (i.e., 
the square of a Gaussian variate), which makes sense as the gamma distribution 
assigns probability 0 to negative values.  


On Jun 20, 2013, at 1:00 PM, Bernard D Santarsiero b...@uic.edu wrote:

 There's absolutely nothing wrong with negative intensities. They are 
 measurements of intensities that are near zero, and some will be negative, 
 and others positive.  The distribution around I=0 can still be Gaussian, and 
 you have true esd's.  With F's you used a derived esd since they can't be 
 formally generated from the sigma's on I, and are very much undetermined for 
 small intensities and small F's. 
 
 Small molecule crystallographers routinely refine on F^2 and use all of the 
 data, even if the F^2's are negative.
 
 Bernie
 
 On Jun 20, 2013, at 11:49 AM, Douglas Theobald wrote:
 
 Seems to me that the negative Is should be dealt with early on, in the 
 integration step.  Why exactly do integration programs report negative Is to 
 begin with?
 
 
 On Jun 20, 2013, at 12:45 PM, Dom Bellini dom.bell...@diamond.ac.uk wrote:
 
 Wouldnt be possible to take advantage of negative Is to 
 extrapolate/estimate the decay of scattering background (kind of Wilson 
 plot of background scattering) to flat out the background and push all the 
 Is to positive values?
 
 More of a question rather than a suggestion ...
 
 D
 
 
 
 From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Ian 
 Tickle
 Sent: 20 June 2013 17:34
 To: ccp4bb
 Subject: Re: [ccp4bb] ctruncate bug?
 
 Yes higher R factors is the usual reason people don't like I-based 
 refinement!
 
 Anyway, refining against Is doesn't solve the problem, it only postpones 
 it: you still need the Fs for maps! (though errors in Fs may be less 
 critical then).
 -- Ian
 
 On 20 June 2013 17:20, Dale Tronrud 
 det...@uoxray.uoregon.edumailto:det...@uoxray.uoregon.edu wrote:
 If you are refining against F's you have to find some way to avoid
 calculating the square root of a negative number.  That is why people
 have historically rejected negative I's and why Truncate and cTruncate
 were invented.
 
 When refining against I, the calculation of (Iobs - Icalc)^2 couldn't
 care less if Iobs happens to be negative.
 
 As for why people still refine against F...  When I was distributing
 a refinement package it could refine against I but no one wanted to do
 that.  The R values ended up higher, but they were looking at R
 values calculated from F's.  Of course the F based R values are lower
 when you refine against F's, that means nothing.
 
 If we could get the PDB to report both the F and I based R values
 for all models maybe we could get a start toward moving to intensity
 refinement.
 
 Dale Tronrud
 
 
 On 06/20/2013 09:06 AM, Douglas Theobald wrote:
 Just trying to understand the basic issues here.  How could refining 
 directly against intensities solve the fundamental problem of negative 
 intensity values?
 
 
 On Jun 20, 2013, at 11:34 AM, Bernhard Rupp 
 hofkristall...@gmail.commailto:hofkristall...@gmail.com wrote:
 As a maybe better alternative, we should (once again) consider to refine 
 against intensities (and I guess George Sheldrick would agree here).
 
 I have a simple question - what exactly, short of some sort of historic 
 inertia (or memory lapse), is the reason NOT to refine against intensities?
 
 Best, BR
 
 
 
 
 -- 
 
 This e-mail and any attachments may contain confidential, copyright and or 
 privileged material, and are for the use of the intended addressee only. If 
 you are not the intended addressee or an authorised recipient of the 
 addressee please notify us of receipt by returning the e-mail and do not 
 use, copy, retain, distribute or disclose the information in or attached to 
 the e-mail.
 
 Any opinions expressed within this e-mail are those of the individual and 
 not necessarily of Diamond Light Source Ltd. 
 
 Diamond Light Source Ltd. cannot guarantee that this e-mail or any 
 attachments are free from viruses and we cannot accept liability for any 
 damage which you may sustain as a result of software viruses which may be 
 transmitted in or with the message.
 
 Diamond Light Source Limited (company no. 4375679). Registered in England 
 and Wales with its registered office at Diamond House, Harwell Science and 
 Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 
 
 
 
 
 
 
 
 
 
 


Re: [ccp4bb] ctruncate bug?

2013-06-20 Thread Douglas Theobald
I still don't see how you get a negative intensity from that.  It seems you are 
saying that in many cases of a low intensity reflection, the integrated spot 
will be lower than the background.  That is not equivalent to having a negative 
measurement (as the measurement is actually positive, and sometimes things are 
randomly less positive than backgroiund).  If you are using a proper 
statistical model, after background correction you will end up with a positive 
(or 0) value for the integrated intensity.  


On Jun 20, 2013, at 1:08 PM, Andrew Leslie and...@mrc-lmb.cam.ac.uk wrote:

 
 The integration programs report a negative intensity simply because that is 
 the observation. 
 
 Because of noise in the Xray background, in a large sample of intensity 
 estimates for reflections whose true intensity is very very small one will 
 inevitably get some measurements that are negative. These must not be 
 rejected because this will lead to bias (because some of these intensities 
 for symmetry mates will be estimated too large rather than too small). It is 
 not unusual for the intensity to remain negative even after averaging 
 symmetry mates.
 
 Andrew
 
 
 On 20 Jun 2013, at 11:49, Douglas Theobald dtheob...@brandeis.edu wrote:
 
 Seems to me that the negative Is should be dealt with early on, in the 
 integration step.  Why exactly do integration programs report negative Is to 
 begin with?
 
 
 On Jun 20, 2013, at 12:45 PM, Dom Bellini dom.bell...@diamond.ac.uk wrote:
 
 Wouldnt be possible to take advantage of negative Is to 
 extrapolate/estimate the decay of scattering background (kind of Wilson 
 plot of background scattering) to flat out the background and push all the 
 Is to positive values?
 
 More of a question rather than a suggestion ...
 
 D
 
 
 
 From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Ian 
 Tickle
 Sent: 20 June 2013 17:34
 To: ccp4bb
 Subject: Re: [ccp4bb] ctruncate bug?
 
 Yes higher R factors is the usual reason people don't like I-based 
 refinement!
 
 Anyway, refining against Is doesn't solve the problem, it only postpones 
 it: you still need the Fs for maps! (though errors in Fs may be less 
 critical then).
 -- Ian
 
 On 20 June 2013 17:20, Dale Tronrud 
 det...@uoxray.uoregon.edumailto:det...@uoxray.uoregon.edu wrote:
 If you are refining against F's you have to find some way to avoid
 calculating the square root of a negative number.  That is why people
 have historically rejected negative I's and why Truncate and cTruncate
 were invented.
 
 When refining against I, the calculation of (Iobs - Icalc)^2 couldn't
 care less if Iobs happens to be negative.
 
 As for why people still refine against F...  When I was distributing
 a refinement package it could refine against I but no one wanted to do
 that.  The R values ended up higher, but they were looking at R
 values calculated from F's.  Of course the F based R values are lower
 when you refine against F's, that means nothing.
 
 If we could get the PDB to report both the F and I based R values
 for all models maybe we could get a start toward moving to intensity
 refinement.
 
 Dale Tronrud
 
 
 On 06/20/2013 09:06 AM, Douglas Theobald wrote:
 Just trying to understand the basic issues here.  How could refining 
 directly against intensities solve the fundamental problem of negative 
 intensity values?
 
 
 On Jun 20, 2013, at 11:34 AM, Bernhard Rupp 
 hofkristall...@gmail.commailto:hofkristall...@gmail.com wrote:
 As a maybe better alternative, we should (once again) consider to refine 
 against intensities (and I guess George Sheldrick would agree here).
 
 I have a simple question - what exactly, short of some sort of historic 
 inertia (or memory lapse), is the reason NOT to refine against intensities?
 
 Best, BR
 
 
 
 
 -- 
 
 This e-mail and any attachments may contain confidential, copyright and or 
 privileged material, and are for the use of the intended addressee only. If 
 you are not the intended addressee or an authorised recipient of the 
 addressee please notify us of receipt by returning the e-mail and do not 
 use, copy, retain, distribute or disclose the information in or attached to 
 the e-mail.
 
 Any opinions expressed within this e-mail are those of the individual and 
 not necessarily of Diamond Light Source Ltd. 
 
 Diamond Light Source Ltd. cannot guarantee that this e-mail or any 
 attachments are free from viruses and we cannot accept liability for any 
 damage which you may sustain as a result of software viruses which may be 
 transmitted in or with the message.
 
 Diamond Light Source Limited (company no. 4375679). Registered in England 
 and Wales with its registered office at Diamond House, Harwell Science and 
 Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 
 
 
 
 
 
 
 
 
 


Re: [ccp4bb] ctruncate bug?

2013-06-20 Thread Douglas Theobald
On Jun 20, 2013, at 1:47 PM, Felix Frolow mbfro...@post.tau.ac.il wrote:

 Intensity is subtraction:  Inet=Iobs - Ibackground.  Iobs and Ibackground can 
 not be negative.  Inet CAN be negative if background is higher than Iobs. 

Just to reiterate, we know that the true value of Inet cannot be negative.  
Hence, the equation you quote is invalid and illogical --- it has no physical 
or statistical justification (except as an approximation for large Iobs and low 
Iback, when ironically background correction is unnecessary).  That equation 
does not account for random statistical fluctuations (e.g., simple Poisson 
counting statistics of shot noise).  


 We do not know how to model background scattering modulated my molecular 
 transform and mechanical motion of the molecule, 
 I recall we have called it TDS - thermal diffuse scattering. Many years ago 
 Boaz Shaanan and JH were fascinated by it.
 If we would know how deal with TDS, we would go to much nicer structures some 
 of us like and for sure to much lower 
 R factors all of us love excluding maybe referees who will claim over 
 refinement :-\
 Dr Felix Frolow   
 Professor of Structural Biology and Biotechnology, 
 Department of Molecular Microbiology and Biotechnology
 Tel Aviv University 69978, Israel
 
 Acta Crystallographica F, co-editor
 
 e-mail: mbfro...@post.tau.ac.il
 Tel:  ++972-3640-8723
 Fax: ++972-3640-9407
 Cellular: 0547 459 608
 
 On Jun 20, 2013, at 20:07 , Douglas Theobald dtheob...@brandeis.edu wrote:
 
 How can there be nothing wrong with something that is unphysical?  
 Intensities cannot be negative.  How could you measure a negative number of 
 photons?  You can only have a Gaussian distribution around I=0 if you are 
 using an incorrect, unphysical statistical model.  As I understand it, the 
 physics predicts that intensities from diffraction should be gamma 
 distributed (i.e., the square of a Gaussian variate), which makes sense as 
 the gamma distribution assigns probability 0 to negative values.  
 
 
 On Jun 20, 2013, at 1:00 PM, Bernard D Santarsiero b...@uic.edu wrote:
 
 There's absolutely nothing wrong with negative intensities. They are 
 measurements of intensities that are near zero, and some will be negative, 
 and others positive.  The distribution around I=0 can still be Gaussian, 
 and you have true esd's.  With F's you used a derived esd since they can't 
 be formally generated from the sigma's on I, and are very much undetermined 
 for small intensities and small F's. 
 
 Small molecule crystallographers routinely refine on F^2 and use all of the 
 data, even if the F^2's are negative.
 
 Bernie
 
 On Jun 20, 2013, at 11:49 AM, Douglas Theobald wrote:
 
 Seems to me that the negative Is should be dealt with early on, in the 
 integration step.  Why exactly do integration programs report negative Is 
 to begin with?
 
 
 On Jun 20, 2013, at 12:45 PM, Dom Bellini dom.bell...@diamond.ac.uk 
 wrote:
 
 Wouldnt be possible to take advantage of negative Is to 
 extrapolate/estimate the decay of scattering background (kind of Wilson 
 plot of background scattering) to flat out the background and push all 
 the Is to positive values?
 
 More of a question rather than a suggestion ...
 
 D
 
 
 
 From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Ian 
 Tickle
 Sent: 20 June 2013 17:34
 To: ccp4bb
 Subject: Re: [ccp4bb] ctruncate bug?
 
 Yes higher R factors is the usual reason people don't like I-based 
 refinement!
 
 Anyway, refining against Is doesn't solve the problem, it only postpones 
 it: you still need the Fs for maps! (though errors in Fs may be less 
 critical then).
 -- Ian
 
 On 20 June 2013 17:20, Dale Tronrud 
 det...@uoxray.uoregon.edumailto:det...@uoxray.uoregon.edu wrote:
 If you are refining against F's you have to find some way to avoid
 calculating the square root of a negative number.  That is why people
 have historically rejected negative I's and why Truncate and cTruncate
 were invented.
 
 When refining against I, the calculation of (Iobs - Icalc)^2 couldn't
 care less if Iobs happens to be negative.
 
 As for why people still refine against F...  When I was distributing
 a refinement package it could refine against I but no one wanted to do
 that.  The R values ended up higher, but they were looking at R
 values calculated from F's.  Of course the F based R values are lower
 when you refine against F's, that means nothing.
 
 If we could get the PDB to report both the F and I based R values
 for all models maybe we could get a start toward moving to intensity
 refinement.
 
 Dale Tronrud
 
 
 On 06/20/2013 09:06 AM, Douglas Theobald wrote:
 Just trying to understand the basic issues here.  How could refining 
 directly against intensities solve the fundamental problem of negative 
 intensity values?
 
 
 On Jun 20, 2013, at 11:34 AM, Bernhard Rupp 
 hofkristall...@gmail.commailto:hofkristall...@gmail.com wrote:
 As a maybe better alternative, we should (once again

Re: [ccp4bb] ctruncate bug?

2013-06-20 Thread Douglas Theobald
Kay, I understand the French-Wilson way of currently doing things, as you 
outline below.  My point is that it is not optimal --- we could do things 
better --- since even French-Wilson accepts the idea of negative intensity 
measurements.  I am trying to disabuse the (very stubborn) view that when the 
background is more than the spot, the only possible estimate of the intensity 
is a negative value.  This is untrue, and unjustified by the physics involved.  
In principle, there is no reason to use French-Wilson, as we should never have 
reported a negative integrated intensity to begin with.  

I also understand that (Iobs-Icalc)^2 is not the actual refinement target, but 
the same point applies, and the actual target is based on a fundamental 
Gaussian assumption for the Is.  


On Jun 20, 2013, at 2:13 PM, Kay Diederichs kay.diederi...@uni-konstanz.de 
wrote:

 Douglas,
 
 the intensity is negative if the integrated spot has a lower intensity than 
 the estimate of the background under the spot. So yes, we are not _measuring_ 
 negative intensities, rather we are estimating intensities, and that estimate 
 may turn out to be negative. In a later step we try to correct for this, 
 because it is non-physical, as you say. At that point, the proper 
 statistical model comes into play. Essentially we use this as a prior. In 
 the order of increasing information, we can have more or less informative 
 priors for weak reflections:
 1) I  0
 2) I has a distribution looking like the right half of a Gaussian, and we 
 estimate its width from the variance of the intensities in a resolution shell
 3) I follows a Wilson distribution, and we estimate its parameters from the 
 data in a resolution shell
 4) I must be related to Fcalc^2 (i.e. once the structure is solved, we 
 re-integrate using the Fcalc as prior)
 For a given experiment, the problem is chicken-and-egg in the sense that only 
 if you know the characteristics of the data can you choose the correct prior.
 I guess that using prior 4) would be heavily frowned upon because there is a 
 danger of model bias. You could say: A Bayesian analysis done properly should 
 not suffer from model bias. This is probably true, but the theory to ensure 
 the word properly is not available at the moment.
 Crystallographers usually use prior 3) which, as I tried to point out, also 
 has its weak spots, namely if the data do not behave like those of an ideal 
 crystal - and today's projects often result in data that would have been 
 discarded ten years ago, so they are far from ideal.
 Prior 2) is available as an option in XDSCONV
 Prior 1) seems to be used, or is available, in ctruncate in certain cases (I 
 don't know the details)
 
 Using intensities instead of amplitudes in refinement would avoid having to 
 choose a prior, and refinement would therefore not be compromised in case of 
 data violating the assumptions underlying the prior. 
 
 By the way, it is not (Iobs-Icalc)^2 that would be optimized in refinement 
 against intensities, but rather the corresponding maximum likelihood formula 
 (which I seem to remember is more complicated than the amplitude ML formula, 
 or is not an analytical formula at all, but maybe somebody knows better).
 
 best,
 
 Kay
 
 
 On Thu, 20 Jun 2013 13:14:28 -0400, Douglas Theobald dtheob...@brandeis.edu 
 wrote:
 
 I still don't see how you get a negative intensity from that.  It seems you 
 are saying that in many cases of a low intensity reflection, the integrated 
 spot will be lower than the background.  That is not equivalent to having a 
 negative measurement (as the measurement is actually positive, and sometimes 
 things are randomly less positive than backgroiund).  If you are using a 
 proper statistical model, after background correction you will end up with a 
 positive (or 0) value for the integrated intensity.  
 
 
 On Jun 20, 2013, at 1:08 PM, Andrew Leslie and...@mrc-lmb.cam.ac.uk wrote:
 
 
 The integration programs report a negative intensity simply because that is 
 the observation. 
 
 Because of noise in the Xray background, in a large sample of intensity 
 estimates for reflections whose true intensity is very very small one will 
 inevitably get some measurements that are negative. These must not be 
 rejected because this will lead to bias (because some of these intensities 
 for symmetry mates will be estimated too large rather than too small). It 
 is not unusual for the intensity to remain negative even after averaging 
 symmetry mates.
 
 Andrew
 
 
 On 20 Jun 2013, at 11:49, Douglas Theobald dtheob...@brandeis.edu wrote:
 
 Seems to me that the negative Is should be dealt with early on, in the 
 integration step.  Why exactly do integration programs report negative Is 
 to begin with?
 
 
 On Jun 20, 2013, at 12:45 PM, Dom Bellini dom.bell...@diamond.ac.uk 
 wrote:
 
 Wouldnt be possible to take advantage of negative Is to 
 extrapolate/estimate the decay of scattering background (kind of Wilson

Re: [ccp4bb] ctruncate bug?

2013-06-20 Thread Douglas Theobald
Well, I tend to think Ian is probably right, that doing things the proper way 
(vs French-Wilson) will not make much of a difference in the end.  

Nevertheless, I don't think refining against the (possibly negative) 
intensities is a good solution to dealing with negative intensities --- that 
just ignores the problem, and will end up overweighting large negative 
intensities.  Wouldn't it be better to correct the negative intensities with FW 
and then refine against that?


On Jun 20, 2013, at 3:38 PM, Kay Diederichs kay.diederi...@uni-konstanz.de 
wrote:

 Douglas,
 
 as soon as you come up with an algorithm that gives accurate, unbiased 
 intensity estimates together with their standard deviations, everybody will 
 be happy. But I'm not aware of progress in this question (Poisson signal with 
 background) in the last decades - I'd be glad to be proven wrong!
 
 Kay
 
 Am 20.06.13 21:27, schrieb Douglas Theobald:
 Kay, I understand the French-Wilson way of currently doing things, as you 
 outline below.  My point is that it is not optimal --- we could do things 
 better --- since even French-Wilson accepts the idea of negative intensity 
 measurements.  I am trying to disabuse the (very stubborn) view that when 
 the background is more than the spot, the only possible estimate of the 
 intensity is a negative value.  This is untrue, and unjustified by the 
 physics involved.  In principle, there is no reason to use French-Wilson, as 
 we should never have reported a negative integrated intensity to begin with.
 
 I also understand that (Iobs-Icalc)^2 is not the actual refinement target, 
 but the same point applies, and the actual target is based on a fundamental 
 Gaussian assumption for the Is.
 
 
 On Jun 20, 2013, at 2:13 PM, Kay Diederichs kay.diederi...@uni-konstanz.de 
 wrote:
 
 Douglas,
 
 the intensity is negative if the integrated spot has a lower intensity than 
 the estimate of the background under the spot. So yes, we are not 
 _measuring_ negative intensities, rather we are estimating intensities, and 
 that estimate may turn out to be negative. In a later step we try to 
 correct for this, because it is non-physical, as you say. At that point, 
 the proper statistical model comes into play. Essentially we use this as 
 a prior. In the order of increasing information, we can have more or less 
 informative priors for weak reflections:
 1) I  0
 2) I has a distribution looking like the right half of a Gaussian, and we 
 estimate its width from the variance of the intensities in a resolution 
 shell
 3) I follows a Wilson distribution, and we estimate its parameters from the 
 data in a resolution shell
 4) I must be related to Fcalc^2 (i.e. once the structure is solved, we 
 re-integrate using the Fcalc as prior)
 For a given experiment, the problem is chicken-and-egg in the sense that 
 only if you know the characteristics of the data can you choose the correct 
 prior.
 I guess that using prior 4) would be heavily frowned upon because there is 
 a danger of model bias. You could say: A Bayesian analysis done properly 
 should not suffer from model bias. This is probably true, but the theory to 
 ensure the word properly is not available at the moment.
 Crystallographers usually use prior 3) which, as I tried to point out, also 
 has its weak spots, namely if the data do not behave like those of an ideal 
 crystal - and today's projects often result in data that would have been 
 discarded ten years ago, so they are far from ideal.
 Prior 2) is available as an option in XDSCONV
 Prior 1) seems to be used, or is available, in ctruncate in certain cases 
 (I don't know the details)
 
 Using intensities instead of amplitudes in refinement would avoid having to 
 choose a prior, and refinement would therefore not be compromised in case 
 of data violating the assumptions underlying the prior.
 
 By the way, it is not (Iobs-Icalc)^2 that would be optimized in refinement 
 against intensities, but rather the corresponding maximum likelihood 
 formula (which I seem to remember is more complicated than the amplitude ML 
 formula, or is not an analytical formula at all, but maybe somebody knows 
 better).
 
 best,
 
 Kay
 
 
 On Thu, 20 Jun 2013 13:14:28 -0400, Douglas Theobald 
 dtheob...@brandeis.edu wrote:
 
 I still don't see how you get a negative intensity from that.  It seems 
 you are saying that in many cases of a low intensity reflection, the 
 integrated spot will be lower than the background.  That is not equivalent 
 to having a negative measurement (as the measurement is actually positive, 
 and sometimes things are randomly less positive than backgroiund).  If you 
 are using a proper statistical model, after background correction you will 
 end up with a positive (or 0) value for the integrated intensity.
 
 
 On Jun 20, 2013, at 1:08 PM, Andrew Leslie and...@mrc-lmb.cam.ac.uk 
 wrote:
 
 
 The integration programs report a negative intensity simply because

Re: [ccp4bb] Strand distorsion and residue disconnectivity in pymol

2013-05-30 Thread Douglas Theobald
To me, that's not a problem.  The wavy representation is more accurate (as far 
as cartoon accuracy can go), as the strand actually follows the alpha 
carbons.  This is why Pauling called it a pleated sheet --- it's got pleats.  
Beta sheets/strands *should* be wavy.  


On May 29, 2013, at 11:29 PM, wu donghui wdh0...@gmail.com wrote:

 Dear all,
  
 I found a problem when I use pymol to prepare structure interface. Strand is 
 distorted when residue from the strand is connected to the strand by turning 
 on side_chain_helper on. However when side_chain_helper is off, the strand 
 turns to normal shape but the residue from it is disconnected to the strand. 
 I attached the picture for your help. I know there must be some tricks for 
 this. Welcome for any input. Thanks a lot.
  
 Best,
  
 Donghui
 Distorsion and connectivity in pymol for strand.pdf


Re: [ccp4bb] how to update phenix

2013-02-11 Thread Douglas Theobald
On Mon, Feb 11, 2013 at 12:12 PM, Tim Gruene t...@shelx.uni-ac.gwdg.de wrote:
 -BEGIN PGP SIGNED MESSAGE--
 Hash: SHA1

 Dear Bill,

 I disagree to your criticism. From http://www.ccp4.ac.uk/ccp4bb.php:
 CCP4bb is an electronic mailing list intended to host discussions
 about topics of general interest to macromolecular
 crystallographers.[...]

 Personally I am only subscribed to three mailing lists and I refrain
 from subscribing to more which is one of the reasons why I welcome the
 liberal topic description of the ccp4bb.

So the more appropriate analogy would be asking my mistress what to
get my wife for V-day.


 Cheers,
 Tim

 On 02/10/2013 06:20 PM, William G. Scott wrote:
 On Feb 10, 2013, at 8:23 AM, LISA science...@gmail.com wrote:

 Hi all, My mac has the old version of phenix. How can i update to
 the new verison? Should I delete the old version and download the
 new version to install as the fist time ? Thanks

 lisa


 You can delete it and download a new version, or simply keep both.
 phenix has version labels on their binaries, for the enjoyment of
 those who use shell auto-completion. e.g.:

 fennario-% phenix.refine  external command phenix.refine
 phenix.refine_1.8.1-1168



 BTW, there is also a phenix bb.

 Asking about this here is kind of like asking my wife what I should
 get my (purely hypothetical) mistress for valentine's day.

 - --
 - --
 Dr Tim Gruene
 Institut fuer anorganische Chemie
 Tammannstr. 4
 D-37077 Goettingen

 GPG Key ID = A46BEE1A

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.12 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

 iD8DBQFRGSZyUxlJ7aRr7hoRAokcAJ9u49SQJlPGpwn1oUDq5a1NuXkKIQCffCqH
 nEahBP42PZl763fdhR0NG2U=
 =ORTo
 -END PGP SIGNATURE-


Re: [ccp4bb] refining against weak data and Table I stats

2012-12-13 Thread Douglas Theobald
On Dec 13, 2012, at 1:52 AM, James Holton jmhol...@lbl.gov wrote:

[snip]

 So, what I would advise is to refine your model with data out to the 
 resolution limit defined by CC*, but declare the resolution of the 
 structure to be where the merged I/sigma(I) falls to 2. You might even want 
 to calculate your Rmerge, Rcryst, Rfree and all the other R values to this 
 resolution as well, since including a lot of zeroes does nothing but 
 artificially drive up estimates of relative error.  

So James --- it appears that you basically agree with my proposal?  I.e., 

(1) include all of the data in refinement (at least up to where CC1/2 or CC* is 
still significant)

(2) keep the definition of resolution to what is more-or-less the defacto 
standard (res bin where I/sigI=2), 

(3) report Table I where everything is calculated up to this resolution (where 
I/sigI=2), and 

(4) maybe include in Supp Mat an additional table that reports statistics for 
all the data (I'm leaning towards a table with stats for each res bin)

As you argued, and as I argued, this seems to be a good compromise, one that 
modifies current practice to include weak data, but nevertheless does not 
change the def of resolution or the Table I stats, so that we can still compare 
with legacy structures/stats.


 Perhaps we should even take a lesson from our small molecule friends and 
 start reporting R1, where the R factor is computed only for hkls where 
 I/sigma(I) is above 3?
 
 -James Holton
 MAD Scientist
 
 On 12/8/2012 4:04 AM, Miller, Mitchell D. wrote:
 I too like the idea of reporting the table 1 stats vs resolution
 rather than just the overall values and highest resolution shell.
 
 I also wanted to point out an earlier thread from April about the
 limitations of the PDB's defining the resolution as being that of
 the highest resolution reflection (even if data is incomplete or weak).
 https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1204L=ccp4bbD=01=ccp4bb9=AI=-3J=ond=No+Match%3BMatch%3BMatchesz=4P=376289
 https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1204L=ccp4bbD=01=ccp4bb9=AI=-3J=ond=No+Match%3BMatch%3BMatchesz=4P=377673
 
 What we have done in the past for cases of low completeness
 in the outer shell is to define the nominal resolution ala Bart
 Hazes' method of same number of reflections as a complete data set and
 use this in the PDB title and describe it in the remark 3 other
 refinement remarks.
   There is also the possibility of adding a comment to the PDB
 remark 2 which we have not used.
 http://www.wwpdb.org/documentation/format33/remarks1.html#REMARK%202
 This should help convince reviewers that you are not trying
 to mis-represent the resolution of the structure.
 
 
 Regards,
 Mitch
 
 -Original Message-
 From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Edward 
 A. Berry
 Sent: Friday, December 07, 2012 8:43 AM
 To: CCP4BB@JISCMAIL.AC.UK
 Subject: Re: [ccp4bb] refining against weak data and Table I stats
 
 Yes, well, actually i'm only a middle author on that paper for a good
 reason, but I did encourage Rebecca and Stephan to use all the data.
 But on a later, much more modest submission, where the outer shell
 was not only weak but very incomplete (edges of the detector),
 the reviewers found it difficult to evaluate the quality
 of the data (we had also excluded a zone with bad ice-ring
 problems). So we provided a second table, cutting off above
 the ice ring in the good strong data, which convinced them
 that at least it is a decent 2A structure. In the PDB it is
 a 1.6A structure. but there was a lot of good data between
 the ice ring and 1.6 A.
 
 Bart Hazes (I think) suggested a statistic called effective
 resolution which is the resolution to which a complete dataset
 would have the number of reflectionin your dataset, and we
 reported this, which came out to something like 1.75.
 
 I do like the idea of reporting in multiple shells, not just overall
 and highest shell, and the PDB accomodatesthis, even has a GUI
 to enter it in the ADIT 2.0 software. It could also be used to
 report two different overall ranges, such as completeness, 25 to 1.6 A,
 which would be shocking in my case, and 25 to 2.0 which would
 be more reassuring.
 
 eab
 
 Douglas Theobald wrote:
 Hi Ed,
 
 Thanks for the comments.  So what do you recommend?  Refine against weak 
 data, and report all stats in a single Table I?
 
 Looking at your latest V-ATPase structure paper, it appears you favor 
 something like that, since you report a high res shell with I/sigI=1.34 and 
 Rsym=1.65.
 
 
 On Dec 6, 2012, at 7:24 PM, Edward A. Berryber...@upstate.edu  wrote:
 
 Another consideration here is your PDB deposition. If the reason for using
 weak data is to get a better structure, presumably you are going to deposit
 the structure using all the data. Then the statistics in the PDB file must
 reflect the high resolution refinement.
 
 There are I think three places in the PDB file where the resolution

Re: [ccp4bb] refining against weak data and Table I stats

2012-12-07 Thread Douglas Theobald
Hi Ed,

Thanks for the comments.  So what do you recommend?  Refine against weak data, 
and report all stats in a single Table I?

Looking at your latest V-ATPase structure paper, it appears you favor something 
like that, since you report a high res shell with I/sigI=1.34 and Rsym=1.65.  


On Dec 6, 2012, at 7:24 PM, Edward A. Berry ber...@upstate.edu wrote:

 Another consideration here is your PDB deposition. If the reason for using
 weak data is to get a better structure, presumably you are going to deposit
 the structure using all the data. Then the statistics in the PDB file must
 reflect the high resolution refinement.
 
 There are I think three places in the PDB file where the resolution is stated,
 but i believe they are all required to be the same and to be equal to the
 highest resolution data used (even if there were only two reflections in that 
 shell).
 Rmerge or Rsymm must be reported, and until recently I think they were not 
 allowed
 to exceed 1.00 (100% error?).
 
 What are your reviewers going to think if the title of your paper is
 structure of protein A at 2.1 A resolution but they check the PDB file
 and the resolution was really 1.9 A?  And Rsymm in the PDB is 0.99 but
 in your table 1* says 1.3?
 
 Douglas Theobald wrote:
 Hello all,
 
 I've followed with interest the discussions here about how we should be 
 refining against weak data, e.g. data with I/sigI  2 (perhaps using all 
 bins that have a significant CC1/2 per Karplus and Diederichs 2012).  This 
 all makes statistical sense to me, but now I am wondering how I should 
 report data and model stats in Table I.
 
 Here's what I've come up with: report two Table I's.  For comparability to 
 legacy structure stats, report a classic Table I, where I call the 
 resolution whatever bin I/sigI=2.  Use that as my high res bin, with high 
 res bin stats reported in parentheses after global stats.   Then have 
 another Table (maybe Table I* in supplementary material?) where I report 
 stats for the whole dataset, including the weak data I used in refinement.  
 In both tables report CC1/2 and Rmeas.
 
 This way, I don't redefine the (mostly) conventional usage of resolution, 
 my Table I can be compared to precedent, I report stats for all the data and 
 for the model against all data, and I take advantage of the information in 
 the weak data during refinement.
 
 Thoughts?
 
 Douglas
 
 
 ^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`
 Douglas L. Theobald
 Assistant Professor
 Department of Biochemistry
 Brandeis University
 Waltham, MA  02454-9110
 
 dtheob...@brandeis.edu
 http://theobald.brandeis.edu/
 
 ^\
   /`  /^.  / /\
  / / /`/  / . /`
 / /  '   '
 '
 
 
 


Re: [ccp4bb] refining against weak data and Table I stats

2012-12-07 Thread Douglas Theobald
Hi Boaz,

I read the KK paper as primarily a justification for including extremely weak 
data in refinement (and of course introducing a new single statistic that can 
judge data *and* model quality comparably).  Using CC1/2 to gauge resolution 
seems like a good option, but I never got from the paper exactly how to do 
that.  The resolution bin where CC1/2=0.5 seems natural, but in my (limited) 
experience that gives almost the same answer as I/sigI=2 (see also KK fig 3).



On Dec 7, 2012, at 6:21 AM, Boaz Shaanan bshaa...@exchange.bgu.ac.il wrote:

 Hi,
 
 I'm sure Kay will have something to say  about this but I think the idea of 
 the K  K paper was to introduce new (more objective) standards for deciding 
 on the resolution, so I don't see why another table is needed.
 
 Cheers,
 
 
 
 
   Boaz
 
 
 Boaz Shaanan, Ph.D.
 Dept. of Life Sciences
 Ben-Gurion University of the Negev
 Beer-Sheva 84105
 Israel
 
 E-mail: bshaa...@bgu.ac.il
 Phone: 972-8-647-2220  Skype: boaz.shaanan
 Fax:   972-8-647-2992 or 972-8-646-1710
 
 
 
 
 
 
 From: CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UK] on behalf of Douglas 
 Theobald [dtheob...@brandeis.edu]
 Sent: Friday, December 07, 2012 1:05 AM
 To: CCP4BB@JISCMAIL.AC.UK
 Subject: [ccp4bb] refining against weak data and Table I stats
 
 Hello all,
 
 I've followed with interest the discussions here about how we should be 
 refining against weak data, e.g. data with I/sigI  2 (perhaps using all 
 bins that have a significant CC1/2 per Karplus and Diederichs 2012).  This 
 all makes statistical sense to me, but now I am wondering how I should report 
 data and model stats in Table I.
 
 Here's what I've come up with: report two Table I's.  For comparability to 
 legacy structure stats, report a classic Table I, where I call the 
 resolution whatever bin I/sigI=2.  Use that as my high res bin, with high 
 res bin stats reported in parentheses after global stats.   Then have another 
 Table (maybe Table I* in supplementary material?) where I report stats for 
 the whole dataset, including the weak data I used in refinement.  In both 
 tables report CC1/2 and Rmeas.
 
 This way, I don't redefine the (mostly) conventional usage of resolution, 
 my Table I can be compared to precedent, I report stats for all the data and 
 for the model against all data, and I take advantage of the information in 
 the weak data during refinement.
 
 Thoughts?
 
 Douglas
 
 
 ^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`
 Douglas L. Theobald
 Assistant Professor
 Department of Biochemistry
 Brandeis University
 Waltham, MA  02454-9110
 
 dtheob...@brandeis.edu
 http://theobald.brandeis.edu/
 
^\
  /`  /^.  / /\
 / / /`/  / . /`
 / /  '   '
 '
 


Re: [ccp4bb] refining against weak data and Table I stats

2012-12-07 Thread Douglas Theobald
A good way to think about it is that if CC1/2=100%, that means you can split 
the data in half, and use one half to perfectly predict the corresponding 
values of the other half. So yes, perfect internal consistency.


On Dec 7, 2012, at 11:41 AM, Phil Evans p...@mrc-lmb.cam.ac.uk wrote:

 It is internally consistent, though not necessarily correct
 
 
 On 7 Dec 2012, at 16:23, Alan Cheung wrote:
 
 Related to this, I've always wondered what CC1/2 values mean for low 
 resolution. Not being mathematically inclined, I'm sure this is a naive 
 question, but i'll ask anyway - what does CC1/2=100 (or 99.9) mean? Does it 
 mean the data is as good as it gets?
 
 Alan
 
 
 
 On 07/12/2012 17:15, Douglas Theobald wrote:
 Hi Boaz,
 
 I read the KK paper as primarily a justification for including extremely 
 weak data in refinement (and of course introducing a new single statistic 
 that can judge data *and* model quality comparably).  Using CC1/2 to gauge 
 resolution seems like a good option, but I never got from the paper exactly 
 how to do that.  The resolution bin where CC1/2=0.5 seems natural, but in 
 my (limited) experience that gives almost the same answer as I/sigI=2 (see 
 also KK fig 3).
 
 
 
 On Dec 7, 2012, at 6:21 AM, Boaz Shaanan bshaa...@exchange.bgu.ac.il 
 wrote:
 
 Hi,
 
 I'm sure Kay will have something to say  about this but I think the idea 
 of the K  K paper was to introduce new (more objective) standards for 
 deciding on the resolution, so I don't see why another table is needed.
 
 Cheers,
 
 
 
 
  Boaz
 
 
 Boaz Shaanan, Ph.D.
 Dept. of Life Sciences
 Ben-Gurion University of the Negev
 Beer-Sheva 84105
 Israel
 
 E-mail: bshaa...@bgu.ac.il
 Phone: 972-8-647-2220  Skype: boaz.shaanan
 Fax:   972-8-647-2992 or 972-8-646-1710
 
 
 
 
 
 
 From: CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UK] on behalf of Douglas 
 Theobald [dtheob...@brandeis.edu]
 Sent: Friday, December 07, 2012 1:05 AM
 To: CCP4BB@JISCMAIL.AC.UK
 Subject: [ccp4bb] refining against weak data and Table I stats
 
 Hello all,
 
 I've followed with interest the discussions here about how we should be 
 refining against weak data, e.g. data with I/sigI  2 (perhaps using all 
 bins that have a significant CC1/2 per Karplus and Diederichs 2012).  
 This all makes statistical sense to me, but now I am wondering how I 
 should report data and model stats in Table I.
 
 Here's what I've come up with: report two Table I's.  For comparability to 
 legacy structure stats, report a classic Table I, where I call the 
 resolution whatever bin I/sigI=2.  Use that as my high res bin, with 
 high res bin stats reported in parentheses after global stats.   Then have 
 another Table (maybe Table I* in supplementary material?) where I report 
 stats for the whole dataset, including the weak data I used in refinement. 
  In both tables report CC1/2 and Rmeas.
 
 This way, I don't redefine the (mostly) conventional usage of 
 resolution, my Table I can be compared to precedent, I report stats for 
 all the data and for the model against all data, and I take advantage of 
 the information in the weak data during refinement.
 
 Thoughts?
 
 Douglas
 
 
 ^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`
 Douglas L. Theobald
 Assistant Professor
 Department of Biochemistry
 Brandeis University
 Waltham, MA  02454-9110
 
 dtheob...@brandeis.edu
 http://theobald.brandeis.edu/
 
   ^\
 /`  /^.  / /\
 / / /`/  / . /`
 / /  '   '
 '
 
 
 
 
 -- 
 Alan Cheung
 Gene Center
 Ludwig-Maximilians-University
 Feodor-Lynen-Str. 25
 81377 Munich
 Germany
 Phone:  +49-89-2180-76845
 Fax:  +49-89-2180-76999
 E-mail: che...@lmb.uni-muenchen.de


[ccp4bb] refining against weak data and Table I stats

2012-12-06 Thread Douglas Theobald
Hello all,

I've followed with interest the discussions here about how we should be 
refining against weak data, e.g. data with I/sigI  2 (perhaps using all bins 
that have a significant CC1/2 per Karplus and Diederichs 2012).  This all 
makes statistical sense to me, but now I am wondering how I should report data 
and model stats in Table I.  

Here's what I've come up with: report two Table I's.  For comparability to 
legacy structure stats, report a classic Table I, where I call the resolution 
whatever bin I/sigI=2.  Use that as my high res bin, with high res bin stats 
reported in parentheses after global stats.   Then have another Table (maybe 
Table I* in supplementary material?) where I report stats for the whole 
dataset, including the weak data I used in refinement.  In both tables report 
CC1/2 and Rmeas.  

This way, I don't redefine the (mostly) conventional usage of resolution, my 
Table I can be compared to precedent, I report stats for all the data and for 
the model against all data, and I take advantage of the information in the weak 
data during refinement. 

Thoughts?

Douglas


^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`
Douglas L. Theobald
Assistant Professor
Department of Biochemistry
Brandeis University
Waltham, MA  02454-9110

dtheob...@brandeis.edu
http://theobald.brandeis.edu/

^\
  /`  /^.  / /\
 / / /`/  / . /`
/ /  '   '
'

 

Re: [ccp4bb] vitrification vs freezing

2012-11-16 Thread Douglas Theobald
On Nov 16, 2012, at 10:27 AM, Enrico Stura est...@cea.fr wrote:

 As a referee I also dislike the word freezing but only if improperly used:
 The crystals were frozen in LN2 is not acceptable because it is the outside
 liquor that is rapidly cooled to cryogenic temperatures.

right, while the crystals within the liquor remain at room temperature :)

 
 But the use of freezing used as the opposite of melting is fine and does 
 not
 imply a crystalline state. Ice is not always crystalline either:
 http://en.wikipedia.org/wiki/Amorphous_ice
 
 
 -- 
 Enrico A. Stura D.Phil. (Oxon) ,Tel: 33 (0)1 69 08 4302 Office
 Room 19, Bat.152,   Tel: 33 (0)1 69 08 9449Lab
 LTMB, SIMOPRO, IBiTec-S, CE Saclay, 91191 Gif-sur-Yvette,   FRANCE
 http://www-dsv.cea.fr/en/institutes/institute-of-biology-and-technology-saclay-ibitec-s/unites-de-recherche/department-of-molecular-engineering-of-proteins-simopro/molecular-toxinology-and-biotechnology-laboratory-ltmb/crystallogenesis-e.-stura
 http://www.chem.gla.ac.uk/protein/mirror/stura/index2.html
 e-mail: est...@cea.fr Fax: 33 (0)1 69 08 90 71


Re: [ccp4bb] vector and scalars

2010-10-16 Thread Douglas Theobald
On Oct 16, 2010, at 3:32 PM, Ian Tickle wrote:

 Hi Tim
 
 As I indicated previously, the Fortran code was only meant to define
 my statement of the problem so that there can be absolutely no
 ambiguity as to the question: the answer to the problem (if it exists)
 has nothing whatsoever to do with the programming language used and I
 don't see how it can be constrained in any way by its semantics, since
 I also provided the questions in algebraic form.  You can't get more
 'natural' than that!
 
 The answer may be provided either algebraically (which would actually
 be preferable) or in any programming language of your choice: I am
 certainly not forcing anyone to code in Fortran if they don't want to.
 If you're saying that I'm unable to solve the problem just because
 I'm programming in Fortran, then you don't understand how algorithmic
 problem solving works: first an working solution must be obtained
 algebraically, then algorithmically, and only then programmatically.
 The first two steps are always the hardest, the last is almost always
 relatively trivial, and the programming language chosen is a matter of
 personal preference.  It cannot constrain the solution, since that
 must already have been completely defined by the first two steps.  I
 have not yet come across a purely algebraic problem which possesses
 semantics that couldn't be expressed in Fortran.  That doesn't mean
 there aren't any, it's just that none of the problems that I've yet
 come across absolutely require programmng in another language: until
 they do I'm happy to stick with Fortran.
 
 Just to be clear again, the statement of the problem, expressed
 entirely algebraically, is:
 
 1) To express F.G using vector notation only, where F and G are
 complex vectors of arbitrary dimension, and
 
 2) Same with F1/G1 where F1 and G1 are complex numbers (e.g.
 individual elements of the above complex vectors).

Ian -- Fortran itself actually treats complex numbers internally as
vectors, so clearly there is a solution to your problem.  In any case,
you can easily program, in any language you want, F.G or F1/G1 using
vector arithmetic.  You cannot, however, confine yourself to the common
standard dot and cross product.  But, contrary to what you are
apparently implying, those are not the only two possible vector
multiplication operations that can be formally defined for vectors. As a
simple counterexample, you can do element-wise vector multiplication and
division. There is also the well-known geometric product (from Clifford 
algebra),
the vector perp dot product, the vector direct product, and the wedge (exterior)
product.  The geometric product is esp. relevant here, because in 2D it is the 
same operation as multiplying two complex numbers
(see http://en.wikipedia.org/wiki/Geometric_algebra#Complex_numbers ).

These pages may be helpful for other examples:

http://www.euclideanspace.com/maths/algebra/vectors/vecAlgebra/powers/index.htm
http://www.euclideanspace.com/maths/algebra/vectors/vecAlgebra/exponent/index.htm

As I said earlier, if an entity fulfills the axioms of a vector space, then
they are vectors.  

http://en.wikipedia.org/wiki/Vector_space#Definition

Complex numbers fulfill these axioms.  On the other hand, there is no
requirement for vectors to have valid dot and cross products defined.
Euclidean vectors do, but that does not mean they are not vectors. 
Complex numbers have other operations defined for them, but again that
does not mean that we cannot consider them as vectors in two dimensions.
In fact, it is common in mathematics to consider complex numbers as 2x2
*matrices*, in which the matrix corresponding to i is an orthogonal 90
degree rotation matrix.  

Cheers,

Douglas

 You see, absolutely no Fortran!
 
 Cheers
 
 -- Ian
 
 On Sat, Oct 16, 2010 at 8:50 AM, Tim Gruene t...@shelx.uni-ac.gwdg.de wrote:
 Dear Ian,
 
 maybe you should switch from Fortran to C++. Then you would not be forced to
 make nature follow the semantics of your programming language but can adjust
 your code to the problem you are tackling.
 The question you post would nicely fit into a first year's course on C++ 
 (and of
 course can all be answered very elegantly).
 
 Cheers, Tim
 
 On Fri, Oct 15, 2010 at 11:55:54PM +0100, Ian Tickle wrote:
 On Fri, Oct 15, 2010 at 8:11 PM, Douglas Theobald
 dtheob...@brandeis.edu wrote:
 Vectors are not only three-dimensional, nor only Euclidean -- vectors can 
 be
 defined for any number of arbitrary dimensions.  Your initial comment
 referred to complex numbers, for instance, which are 2D vectors (not 1-D).
  Obviously scalars are not 3-vectors, they are 1-vectors.  And contrary to
 your earlier assertion, you can always represent complex numbers as vectors
 (in fortran, C, on paper, or whatever), and it is possible to define many
 different valid types of multiplication, exponentiation, logarithms, 
 powers,
 etc. for vectors (and matrices as well).
 
 I didn't say that vectors are only 3D or only

Re: [ccp4bb] vector and scalars

2010-10-15 Thread Douglas Theobald
As usual, the Omniscient Wikipedia does a pretty good job of giving the 
standard mathematical definition of a vector:

http://en.wikipedia.org/wiki/Vector_space#Definition

If the thing fulfills the axioms, it's a vector.  Complex numbers do, as well 
as scalars.  

On Oct 15, 2010, at 8:56 AM, David Schuller wrote:

 On 10/14/10 11:22, Ed Pozharski wrote:
 Again, definitions are a matter of choice
 There is no correct definition of anything.
 
 Definitions are a matter of community choice, not personal choice; i.e. a 
 matter of convention. If you come across a short squat animal with split 
 hooves rooting through the mud and choose to define it as a giraffe, you 
 will find yourself ignored and cut off from the larger community which 
 chooses to define it as a pig.
 
 -- 
 ===
 All Things Serve the Beam
 ===
   David J. Schuller
   modern man in a post-modern world
   MacCHESS, Cornell University
   schul...@cornell.edu



^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`
Douglas L. Theobald
Assistant Professor
Department of Biochemistry
Mailstop 009
415 South St
Brandeis University
Waltham, MA  02454-9110

dtheob...@brandeis.edu
http://theobald.brandeis.edu/

Office: +1 (781) 736-2303
Fax:+1 (781) 736-2349

 ^\
   /`  /^.  / /\
  / / /`/  / . /`
 / /  '   '
'






smime.p7s
Description: S/MIME cryptographic signature


Re: [ccp4bb] vector and scalars

2010-10-15 Thread Douglas Theobald
On Oct 15, 2010, at 11:37 AM, Ganesh Natrajan wrote:

 Douglas,
 
 The elements of a 'vector space' are not 'vectors' in the physical
 sense. 

And there you make Ed's point -- some people are using the general vector 
definition, others are using the more restricted Euclidean definition.  

The elements of a general vector space certainly can be physical, by any normal 
sense of the term.  And note that physical 3D space is not Euclidean, in any 
case.

 The correct Wikipedia page is this one
 
 http://en.wikipedia.org/wiki/Euclidean_vector
 
 
 Ganesh
 
 
 
 On Fri, 15 Oct 2010 11:20:04 -0400, Douglas Theobald
 dtheob...@brandeis.edu wrote:
 As usual, the Omniscient Wikipedia does a pretty good job of giving
 the standard mathematical definition of a vector:
 
 http://en.wikipedia.org/wiki/Vector_space#Definition
 
 If the thing fulfills the axioms, it's a vector.  Complex numbers do,
 as well as scalars.
 
 On Oct 15, 2010, at 8:56 AM, David Schuller wrote:
 
 On 10/14/10 11:22, Ed Pozharski wrote:
 Again, definitions are a matter of choice
 There is no correct definition of anything.
 
 Definitions are a matter of community choice, not personal choice; i.e. a 
 matter of convention. If you come across a short squat animal with split 
 hooves rooting through the mud and choose to define it as a giraffe, you 
 will find yourself ignored and cut off from the larger community which 
 chooses to define it as a pig.
 
 --
 ===
 All Things Serve the Beam
 ===
  David J. Schuller
  modern man in a post-modern world
  MacCHESS, Cornell University
  schul...@cornell.edu
 
 
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: [ccp4bb] vector and scalars

2010-10-15 Thread Douglas Theobald
On Oct 15, 2010, at 12:14 PM, William G. Scott wrote:

 As usual, the Omniscient Wikipedia does a pretty good job of giving the 
 standard mathematical definition of a vector:
 
 http://en.wikipedia.org/wiki/Vector_space#Definition
 
 If the thing fulfills the axioms, it's a vector.  Complex numbers do, as 
 well as scalars.  
 
 
 It is a bit more complicated, unfortunately.  cf:

Don't you mean, it's a bit more _complex_? :)

 
 http://en.wikipedia.org/wiki/Complex_number#The_complex_plane
 
 http://en.wikipedia.org/wiki/Complex_number#Real_vector_space



smime.p7s
Description: S/MIME cryptographic signature


Re: [ccp4bb] off topic: multiple structural sequence alignment

2010-01-12 Thread Douglas Theobald
Both  MUSTANG and MATT are good choices:

http://www.cs.mu.oz.au/~arun/mustang/

http://groups.csail.mit.edu/cb/matt/

On Jan 12, 2010, at 7:17 AM, Ronnie Berntsson wrote:

 Dear all,
 
 A bit off the topic question perhaps. 
 I am trying to find a program which can do multiple structural sequence 
 alignments. What I would like is a program which can take as input PDB codes 
 (or files), and which will output a multiple sequence alignment in FASTA 
 format with the full sequences of the supplied proteins intact. Preferably 
 said server/program should be able to handle at least 20 input pdbs at once. 
 I've been looking around, but have so far failed to find a program which does 
 this. If anyone knows of a program or server which could handle this, I would 
 be very grateful.
 
 Cheers,
 Ronnie Berntsson


Re: [ccp4bb] FW: pdb-l: Retraction of 12 Structures

2009-12-16 Thread Douglas Theobald
On Dec 16, 2009, at 7:40 AM, Anastassis Perrakis wrote:

 How very correct. And if anyone is doubt, remember the fiasco of the 'memory 
 of water', published in Nature.
 To borrow the title of DVD's talks, Just because its in Nature, it does not 
 mean its true.

Or, as one of my colleagues is known to say: It's in Nature, and it's even 
right!

 More specifically, we are seeing peer review at work. I that the
 implementation of peer review as part of the publication has perhaps
 lead us to forget that peer review is a longer term, ongoing process,
 conducted by the whole scientific community.
 
 A.


Re: [ccp4bb] units of the B factor

2009-11-23 Thread Douglas Theobald
Argument from authority, from the omniscient Wikipedia:

http://en.wikipedia.org/wiki/Radian

Although the radian is a unit of measure, it is a dimensionless quantity.

The radian is a unit of plane angle, equal to 180/pi (or 360/(2 pi)) degrees, 
or about 57.2958 degrees, It is the standard unit of angular measurement in 
all areas of mathematics beyond the elementary level.

… the radian is now considered an SI derived unit.

On Nov 23, 2009, at 1:31 PM, Ian Tickle wrote:

 James, I think you misunderstood, no-one is suggesting that we can do
 without the degree (minute, second, grad, ...), since these conversion
 units have considerable practical value.  Only the radian (and
 steradian) are technically redundant, and as Marc suggested we would
 probably be better off without them!
 
 Cheers
 
 -- Ian
 
 -Original Message-
 From: owner-ccp...@jiscmail.ac.uk 
 [mailto:owner-ccp...@jiscmail.ac.uk] On Behalf Of James Holton
 Sent: 23 November 2009 16:35
 To: CCP4BB@jiscmail.ac.uk
 Subject: Re: [ccp4bb] units of the B factor
 
 Just because something is dimensionless does not mean it is 
 unit-less.  
 The radian and the degree are very good examples of this.  
 Remember, the 
 word unit means one, and it is the quantity of something that we 
 give the value 1.0.  Things can only be measured relative 
 to something 
 else, and so without defining for the relevant unit, be it 
 a long-hand 
 description or a convenient abbreviation, a number by itself is not 
 useful.  It may have meaning in the metaphysical sense, but its not 
 going to help me solve my structure.
 
 A world without units is all well and good for theoreticians 
 who never 
 have to measure anything, but for those of us who do need to 
 know if the 
 angle is 1 degree or 1 radian, units are absolutely required.
 
 -James Holton
 MAD Scientist
 
 Artem Evdokimov wrote:
 The angle value and the associated basic trigonometric 
 functions (sin, cos,
 tan) are derived from a ratio of two lengths* and therefore are
 dimensionless. 
 
 It's trivial but important to mention that there is no 
 absolute requirement
 of units of any kind whatsoever with respect to angles or 
 to the three basic
 trigonometric functions. All the commonly used units come 
 from (arbitrary)
 scaling constants that in turn are derived purely from convenience -
 specific calculations are conveniently carried out using 
 specific units (be
 they radians, points, seconds, grads, brads, or papaya 
 seeds) however the
 units themselves are there only for our convenience (unlike 
 the absolutely
 required units of mass, length, time etc.). 
 
 Artem
 
 * angle - the ratio of the arc length to radius of the arc 
 necessary to
 bring the two rays forming the angle together; trig 
 functions - the ratio of
 the appropriate sides of a right triangle
 
 -Original Message-
 From: CCP4 bulletin board [mailto:ccp...@jiscmail.ac.uk] On 
 Behalf Of Ian
 Tickle
 Sent: Sunday, November 22, 2009 10:57 AM
 To: CCP4BB@JISCMAIL.AC.UK
 Subject: Re: [ccp4bb] units of the B factor
 
 Back to the original problem: what are the units of B and
 
 u_x^2?  I haven't been able to work that out.  The first
 wack is to say the B occurs in the term
 
 Exp( -B (Sin(theta)/lambda)^2)

 and we've learned that the unit of Sin(theta)/lamda is 1/Angstrom
 and the argument of Exp, like Sin, must be radian.  This means
 that the units of B must be A^2 radian.  Since B = 8 Pi^2 u_x^2
 the units of 8 Pi^2 u_x^2 must also be A^2 radian, but the
 units of u_x^2 are determined by the units of 8 Pi^2.  I
 can't figure out the units of that without understanding the
 defining equation, which is in the OPDXr somewhere.  I suspect
 there are additional, hidden, units in that definition.  The
 basic definition would start with the deviation of scattering
 points from the Miller planes and those deviations are probably
 defined in cycle or radian and later converted to Angstrom so
 there are conversion factors present from the beginning.
 
I'm sure that if the MS sits down with the OPDXr and follows
 all these units through he will uncover the units of B, 8 Pi^2,
 and u_x^2 and the mystery will be solved.  If he doesn't do
 it, I'll have to sit down with the book myself, and that will
 make my head hurt.
 
 
 Hi Dale
 
 A nice entertaining read for a Sunday afternoon, but I think you can
 only get so far with this argument and then it breaks down, 
 as evidenced
 by the fact that eventually you got stuck!  I think the 
 problem arises
 in your assertion that the argument of 'exp' must be in units of
 radians.  IMO it can also be in units of radians^2 (or 
 radians^n where n
 is any unitless number, integer or real, including zero for that
 matter!) - and this seems to be precisely what happens 
 here.  Having a
 function whose argument can apparently have any one of an infinite
 number of units is somewhat of an embarrassment! - of 
 course that must
 mean that the argument actually has no 

Re: [ccp4bb] units of the B factor

2009-11-23 Thread Douglas Theobald
I agree that the official SI documentation has priority, but as I read it there 
is no discrepancy between it and Wikipedia.  The official SI position (and that 
of NIST and IUPAC) is that the radian is a dimensionless unit (i.e., a unit of 
dimension 1).

Quoting at length from the SI brochure:

2.2.3 Units for dimensionless quantities, also called quantities of dimension 
one

Certain quantities are defined as the ratio of two quantities of the same kind, 
and are thus dimensionless, or have a dimension that may be expressed by the 
number one. The coherent SI unit of all such dimensionless quantities, or 
quantities of dimension one, is the number one, since the unit must be the 
ratio of two identical SI units. The values of all such quantities are simply 
expressed as numbers, and the unit one is not explicitly shown. Examples of 
such quantities are refractive index, relative permeability, and friction 
factor. There are also some quantities that are defined as a more complex 
product of simpler quantities in such a way that the product is dimensionless. 
Examples include the 'characteristic numbers' like the Reynolds number Re = 
ρvl/η, where ρ is mass density, η is dynamic viscosity, v is speed, and l is 
length. For all these cases the unit may be considered as the number one, which 
is a dimensionless derived unit.

Another class of dimensionless quantities are numbers that represent a count, 
such as a number of molecules, degeneracy (number of energy levels), and 
partition function in statistical thermodynamics (number of thermally 
accessible states). All of these counting quantities are also described as 
being dimensionless, or of dimension one, and are taken to have the SI unit 
one, although the unit of counting quantities cannot be described as a derived 
unit expressed in terms of the base units of the SI. For such quantities, the 
unit one may instead be regarded as a further base unit.

In a few cases, however, a special name is given to the unit one, in order to 
facilitate the identification of the quantity involved. This is the case for 
the radian and the steradian. The radian and steradian have been identified by 
the CGPM as special names for the coherent derived unit one, to be used to 
express values of plane angle and solid angle, respectively, and are therefore 
included in Table 3.

The radian and steradian are special names for the number one that may be used 
to convey information about the quantity concerned. In practice the symbols rad 
and sr are used where appropriate, but the symbol for the derived unit one is 
generally omitted in specifying the values of dimensionless quantities.

pp 119-120, The International System of Units (SI). International Bureau of 
Weights and Measures (BIPM). 
http://www.bipm.org/utils/common/pdf/si_brochure_8_en.pdf

also see 

http://physics.nist.gov/cuu/Units/units.html
http://www.iupac.org/publications/books/gbook/green_book_2ed.pdf



On Nov 23, 2009, at 4:03 PM, marc.schi...@epfl.ch wrote:

 I would believe that the official SI documentation has precedence over 
 Wikipedia. In the SI brochure it is made quite clear that Radian is just 
 another symbol for the number one and that it may or may no be used, as is 
 convenient.
 
 Therefore, stating alpha = 15 (without anything else) is perfectly valid for 
 an angle.
 
 Marc
 
 
 
 Quoting Douglas Theobald dtheob...@brandeis.edu:
 
 Argument from authority, from the omniscient Wikipedia:
 
 http://en.wikipedia.org/wiki/Radian
 
 Although the radian is a unit of measure, it is a dimensionless quantity.
 
 The radian is a unit of plane angle, equal to 180/pi (or 360/(2 pi)) 
 degrees, or about 57.2958 degrees, It is the standard unit of angular 
 measurement in all areas of mathematics beyond the elementary level.
 
 … the radian is now considered an SI derived unit.
 
 On Nov 23, 2009, at 1:31 PM, Ian Tickle wrote:
 
 James, I think you misunderstood, no-one is suggesting that we can do
 without the degree (minute, second, grad, ...), since these conversion
 units have considerable practical value.  Only the radian (and
 steradian) are technically redundant, and as Marc suggested we would
 probably be better off without them!
 
 Cheers
 
 -- Ian
 
 -Original Message-
 From: owner-ccp...@jiscmail.ac.uk
 [mailto:owner-ccp...@jiscmail.ac.uk] On Behalf Of James Holton
 Sent: 23 November 2009 16:35
 To: CCP4BB@jiscmail.ac.uk
 Subject: Re: [ccp4bb] units of the B factor
 
 Just because something is dimensionless does not mean it is
 unit-less.
 The radian and the degree are very good examples of this.
 Remember, the
 word unit means one, and it is the quantity of something that we
 give the value 1.0.  Things can only be measured relative
 to something
 else, and so without defining for the relevant unit, be it
 a long-hand
 description or a convenient abbreviation, a number by itself is not
 useful.  It may have meaning in the metaphysical sense, but its not
 going to help me

Re: [ccp4bb] Rmerge - was moelcular replacement with large cell

2009-07-15 Thread Douglas Theobald

James,

Graeme is right.  While I does indeed (approximately) follow a  
Gaussian, |I-I| cannot.  The absolute value operator keeps it  
positive (reflects the negative across the origin), and hence it is a  
half Gaussian.  Its mean cannot be zero unless the variance is zero.   
For standard normals (variance = 1), the mean of |I-I| is 0.798,  
just as Graeme said.  You can do the integration.  So, the fact that | 
I-I|/I is unstable at low I/sigma is *not* a consequence of the  
peculiar divergent properties of a Cauchy (Lorentzian).  Rather, it's  
a consequence of E(I) being zero.  And, like your calculator knows,  
division by zero is undefined (or infinite, depending on your  
proclivities).


Cheers,

Douglas


On Jul 15, 2009, at 5:03 PM, James Holton wrote:

I tried plugging I/sigma = 0 into your formula below, but my  
calculator returned 


-James Holton
MAD Scientist

Graeme Winter wrote:

James,

I'm not sure you're completely right here - it's reasonably
straightforward to show that

Rmerge ~ 0.7979 / (I/sigma)

(Weiss  Hilgenfeld, J. Appl. Cryst 1997) which can be verified from
e.g. the Scala log file, provided that the *unmerged* I/sigma is
considered:

http://www.ccp4.ac.uk/xia/rmerge.jpg

This example did not exhibit much radiation damage so it does
represent a best case.

For (unmerged) I/sigma  1 the statistics do tend to become
unreliable, which I found was best demonstrated by inspection of the
E^4 plot - up to I/sigma ~ 1 it was ~ 2, but increased substantially
thereafter. This I had assumed represented the fact that the
intensities were drawn from a gaussian distribution with low I/ 
sigma

rather than the exponential (WIlson) distribution which would be
expected for intensities.

By repeatedly selecting small random subsets* of unique reflections  
in

the example data set and merging them separately, I found that the
error on the Rmerge above for the weakest reflections was about
0.05. Since this retains the same multiplicity and the mean value
converges on the complete data set statistics, I believe that the
comparisons are valid.

I guess I don't believe you :o)

Best,

Graeme



* CCTBX is awesome for this kind of thing!

2009/7/15 James Holton jmhol...@lbl.gov:

Actually, if I/sd  3, Rmerge, Rpim, Rrim, etc. are all infinity.   
Doesn't

matter what your redundancy is.

Don't believe me?  Try it.
The extreme case is I/sd = 0, and as long as there is some  
background (and,
let's face it, there always is), the observed spot intensity  
will be
equally likely to be positive or negative, with a (basically)  
Gaussian

distribution.
So, if you generate say, ten Gaussian-random numbers (centered on  
zero),
take their average value I, compute the average deviation from  
that
average |I-I|, and then divide |I-I|/I, you will get the  
Rmerge
expected for I/sd = 0 at a redundancy of 10.  Problem is, if you  
do this
again with a different random number seed, you will get a very  
different
Rmerge.  Even if you do it with a million different random number  
seeds and
compute the average Rmerge, you will always get wildly different  
values.
Some positive, some negative.  And it doesn't matter how many  
data points
you use to compute the Rmerge: averaging a million Rmerge values  
will give a

different answer than averaging a million and one.

The reason for this numerical instability is because both I and  
|I-I|
follow a Gaussian distribution that is centered at zero, and the  
ratio of
two numbers like this has a Lorentzian distribution.  The  
Lorentzian looks a
lot like a Gaussian, but has much fatter tails.  Fat enough so  
that the
Lorentzian distribution has NO MEAN VALUE.  Seriously.  It is hard  
to
believe that the average value of something that is equally likely  
to be
positive or negative could be anything but zero, but for all  
practical
purposes you can never arrive at the average value of something  
with a
Lorentzian distribution.  At least not by taking finite samples.   
So, no

matter what the redundancy, you will always get a different Rmerge.

However, if I is not centered on zero (I/sd  0), then the ratio  
of the
two Gaussian-random numbers starts to look like a Gaussian itself,  
and this

distribution does have a mean value (Rmerge will be reproducible).
However, this does not happen all at once.  The tails start to  
shrink as
I/sd = 1, they are even smaller at I/sd = 2, and the distribution  
finally
looses all Lorentzian character when I/sd = 3.  Only then is  
Rmerge a

meaningful quantity.

So, perhaps our forefathers who first instituted the practice of  
a 3-sigma
cutoff for all intensities actually DID know what they were  
doing!  All R-
statistics (including Rcryst and Rfree) are unstable in this way  
for weak
data, but sometime in the early 1990s the practice of computing R- 
factors on
all data crept into the field.  I'm not saying we should not use  
all data,
maximum likelihood refinement uses sigmas properly and weak data  
are

Re: [ccp4bb] 3D modeling program

2008-12-07 Thread Douglas Theobald
- Dima Klenchin [EMAIL PROTECTED] wrote:

   But how do we establish phylogeny? - Based on simple similarity!
   (Structural/morphological in early days and largely on sequence
   identity today). It's clearly a circular logic:
 
 Hardly.  Two sequences can be similar and non-homologous at all
 levels. Also, two similar proteins can be homologous at one level but
 not at another. It's also possible for two proteins that have no
 detectable similarity above random sequences to be homologous.  Hence
 there is no circularity.
 
 Of course there is. Just how do you establish that the two are not
 homologous? - By finding that they don't belong to the same branch.
 And how do you decide what constitutes the same branch? - By looking
 at how similar things are!

But you have not established that there is circularity.  Logical
circularity means that you assume (as an essential premise) one of your
conclusions.  What exactly is the argument you are criticizing, and what
is the conclusion that is assumed?  When we conclude that two proteins
are homologous at some level, we have not assumed that they are
homologous at that level. Rather, the conclusion of homology is an
inference that uses similarity as relevant evidence.

   Plus, presumably all living things trace their ancestry to the
   primordial soup - so the presence or a lack of ancestry is just a
   matter of how deeply one is willing to look.
 
 This is also wrong.  Even if all organisms trace back to one common
 ancestor, that does not mean all proteins are homologous.  New
 protein coding genes can and do arise independently, and hence they
 are not homologous to any other existing proteins.
 
 Just how do they arise independently? Would that be independent of DNA
 sequence? And if not, then why can't shared ancestry of the DNA
 sequence fully qualify for homology?

Perhaps it could (although in some cases no), but still the new protein
would not be homologous to any other protein *at the protein level*.

 You also ignore the levels of homology concept -- just because two
 proteins are homologous at one level does not mean they are
 homologous at others.  For example, consider these three TIM barrel
 proteins: human IMPDH, hamster IMPDH, and chicken triose phosphate
 isomerase. They are all three homologous as TIM barrels. However,
 they are not all homologous as dehydrogenases -- only the human and
 hamster proteins are homologous as dehydrogenases.
 
 ... And all that is concluded based on sequence similarities [of other
 proteins/DNAs] to construct phylogenetic tree. So, ultimately,
 homology ~ similarity.

This is a non sequitur.  Yes, homology inference uses similarity as
evidence, but that does not mean homology is equivalent to similarity.
Two facile counterexamples to your claim: two proteins can be very
similar yet non-homologous, and two very dissimilar proteins can be
homologous.  Homology is thus not equivalent to similarity. QED.

 The generic concept of homology used to be used as a proof of
 evolution. Today, things seem to be reversed and evolution is being
 used to infer homology. A useful concept turned into a statement with
 little or no utility.

In fact quite the opposite is true.  Before evolutionary theory,
homology was a vacuous, mysterious concept with no utility.  It was
simply the descriptive observation that similar structures could have
different functions.  Now we know why that is the case.  You have
already pointed out that we have redefined homology (evolutionary
homology is not the same as generic, pre-evolutionary homology), and
this fact proves that the logic is non-circular: we assume generic
homology and conclude evolutionary homology.  This could only be
circular if the two concepts were identical, which you admit they are
not.  Your argument founders on an equivocation.

Cheers,

Douglas


Re: [ccp4bb] 3D modeling program

2008-12-06 Thread Douglas Theobald
- Dima Klenchin [EMAIL PROTECTED] wrote:

But how do we establish phylogeny? - Based on simple similarity!

This is a common, but erroneous, misconception.  Modern phylogenetic
methods (Bayesian, maximum likelihood, and some distance-based) rely on
explicit models of molecular evolution, and the *patterns* of similarity
they create.  Even maximum parsimony, which is not model-based, does not
reconstruct phylogenies based on simple similarity.

 ah! the old rhetorical trick of changing the problem or question a
 posteriori! all i pointed out was that things can't be 25%
 homologous

 Well, you were right that in today's definition things can't be. But
 you seem to be missing my point that today's definition is essentially
 meaningless (relies on circular logic and has no epistemologic value)
 and that nothing would be lost if the term reverted to its generic
 usage, similar. There would still be a question to be asked similar
 for what reason? - same question that is presumed to be answered
 whenever one invokes phylogeny-based homology.

How does this make any sense?  Two proteins can have certain
similarities in sequence (or structure) due to either convergence or
homology.  That is the answer to your question of similar for what
reason, and hence you have just shown that similarity is not the same
as homology, and that homology is not meaningless.

 i'm glad your opinion is humble here, because it has much to be
 humble about :-) do you really think that property (e.g., structure
 and function) prediction is not useful? and i can't even begin to
 understand how you can think that 'homology' in its present-day
 meaning is a pre-darwinian concept.

 Homology is a pre-Darwinian concept that was *redefined*
 post-Darwin. That's what I wrote.

 okay, so can we all agree now that we won't be saying and writing
 things like the two proteins are X% homologous anymore from now on?

 IMHO, it truly does not matter if we do or do not as long as we
 understand each other.

You are hard to understand if you say that two proteins are 25%
homologous.  Do you mean that one domain, out of four, is homologous
between the proteins?  That is the only sense in which that could be
construed as correct.

 Like I wrote in the original reply, paying too much attention to
 definitions of fuzzy abstract concepts is not worth it.

The homology concept is often misunderstood, that is true.  But there
are still blatantly incorrect uses, and substituting 25% homologous
for 25% similar is unequivocaly wrong.

An important point to note is that homology must be qualified.  There
are levels of homology, and a structure can be homologous at one level
but not at another.  The classic example is bird and bat wings.  They
are homologous as vertebrate forelimbs, but not as wings.  


Re: [ccp4bb] 3D modeling program

2008-12-06 Thread Douglas Theobald
- Anastassis Perrakis [EMAIL PROTECTED] wrote:

 I think we are getting a bit too philosophical on a matter which is  
 mainly terminology .

 1. To quantify how similar two proteins are, one should best refer to

 'percent identity'. Thats clear, correct and unambiguous.
 2. One can also refer to similarity. In that case it should be  
 clarified what is considered to be similar, mainly which comparison  
 matrix was used to quantify the similarity.
 3. Homology means common evolutionary origin. One understanding is  
 that homology refers to the genome of 'LUCA', the hypothetical last  
 universal common ancestor. I am not an evolutionary biologist, but I

 would clearly disagree that homology is a leftover pre-Darwinian term.
  
 The very notion of homology is only meaningful in the context of  
 evolution.

 Thus, to me:

 1. These proteins are 56% identical is clear.

Even this is unclear without qualification.  Identity is always determined
by alignment, and you can get different %ID by using different matrices.

 2. These proteins are 62% similar is unclear.
 3. These proteins are 62% similar using the Dayhoff-50 matrix is
 Ok.
 4. These proteins are homologous is clear, but can be subjective as
 to what homology is.
 5. These proteins are 32% homologous is simply wrong.

 Sorry for the non-crystallographic late evening blabber.

 A.

 On 6 Dec 2008, at 21:09, Dima Klenchin wrote:

  Having a generic dictionary definition is nice and dandy. However,

  in the present context, the term 'homology' has a much more  
  specific meaning: it pertains to the having (or not) of a common  
  ancestor. Thus, it is a binary concept. (*)
 
  But how do we establish phylogeny? - Based on simple similarity!  
  (Structural/morphological in early days and largely on sequence  
  identity today). It's clearly a circular logic: Lets not use  
  generic definition; instead, lets use a specialized definition; and

  lets not notice that the specialized definition wholly depends on a

  system that is built using the generic definition to begin with.
 
  Plus, presumably all living things trace their ancestry to the  
  primordial soup - so the presence or a lack of ancestry is just a  
  matter of how deeply one is willing to look. In other words, it's  
  nice and dandy to have theoretical binary concept but in practice it
  
  is just as fuzzy as anything else.
 
  IMHO, the phylogenetic concept of homology in biology does not buy

  you much of anything useful. It seems to be just a leftover from
 pre-
  Darwinian days - redefined since but still lacking solid
 foundation.
 
  Dima


Re: [ccp4bb] 3D modeling program

2008-12-06 Thread Douglas Theobald
- Dima Klenchin [EMAIL PROTECTED] wrote:

 Having a generic dictionary definition is nice and dandy. However, in
 the present context, the term 'homology' has a much more specific
 meaning: it pertains to the having (or not) of a common ancestor.
 Thus, it is a binary concept. (*)
 
 But how do we establish phylogeny? - Based on simple similarity!
 (Structural/morphological in early days and largely on sequence
 identity today). It's clearly a circular logic: 

Hardly.  Two sequences can be similar and non-homologous at all levels.
Also, two similar proteins can be homologous at one level but not at
another. It's also possible for two proteins that have no detectable
similarity above random sequences to be homologous.  Hence there is
no circularity.  

 Lets not use generic definition; instead, lets use a specialized
 definition; and lets not notice that the specialized definition wholly
 depends on a system that is built using the generic definition to
 begin with.
 
 Plus, presumably all living things trace their ancestry to the
 primordial soup - so the presence or a lack of ancestry is just a
 matter of how deeply one is willing to look. 

This is also wrong.  Even if all organisms trace back to one common
ancestor, that does not mean all proteins are homologous.  New protein
coding genes can and do arise independently, and hence they are not
homologous to any other existing proteins.  You also ignore the levels
of homology concept -- just because two proteins are homologous at one
level does not mean they are homologous at others.  For example,
consider these three TIM barrel proteins: human IMPDH, hamster IMPDH,
and chicken triose phosphate isomerase. They are all three homologous as
TIM barrels.  However, they are not all homologous as dehydrogenases --
only the human and hamster proteins are homologous as dehydrogenases.

 In other words, it's nice and dandy to have theoretical binary concept
 but in practice it is just as fuzzy as anything else.
 
 IMHO, the phylogenetic concept of homology in biology does not buy you
 much of anything useful. It seems to be just a leftover from
 pre-Darwinian days - redefined since but still lacking solid
 foundation.
 
 Dima