Re: [ccp4bb] [phenixbb] C-beta RMSD
THESEUS can do it, and it comes bundled with ccp4 so definitely on-topic. If you want RMSD of “equivalent” amino acids, you must tell THESEUS which residues are equivalent with a sequence alignment. Then use the -I option to get the RMSD (and other stats) of the pdb files in their current orientation. E.g., theseus -A cytc.aln -I d1cih__.pdb d1crj__.pdb Cheers, Douglas On Jun 26, 2015, at 3:52 AM, Kaushik Hatti hskaus...@gmail.com wrote: Hi, Is there a tool which can calculate C-beta RMSD for equivalent amino acids of homologous structures, post C-alpha superposition? Sorry if its off topic, Thanks, Kaushik -- Stupidity is everyone’s birthright. However, only the learned exercise it! --Kaushik (28Oct2014) ___ phenixbb mailing list pheni...@phenix-online.org http://phenix-online.org/mailman/listinfo/phenixbb Unsubscribe: phenixbb-le...@phenix-online.org
Re: [ccp4bb] [phenixbb] Allignment of multiple structures
THESEUS should be able to do it rather easily. You can email me offlist if you need some guidance. On Jun 1, 2015, at 3:53 PM, jens j birktoft birkt...@nyu.edu wrote: I apologize if this question this question has been asked before but I still need help finding an answer to the following. I am looking for a program/web-server that will calculate the superposition of multiple structures (non-protein!) Thanks -- +++ Jens J. Birktoft Structural DNA Nanotechnology Department of Chemistry New York University e-mail: jens.kn...@gmail.com; Phone: 212-749-5057 very slow-mail: 350 Central Park West, Suite 9F, New York, NY 10025 +++ ___ phenixbb mailing list pheni...@phenix-online.org http://phenix-online.org/mailman/listinfo/phenixbb Unsubscribe: phenixbb-le...@phenix-online.org
Re: [ccp4bb] [RANT] Reject Papers describing non-open source software
On May 12, 2015, at 3:19 PM, Robbie Joosten robbie_joos...@hotmail.com wrote: I strongly disagree with rejecting paper for any other reasons than scientific ones. I agree, but … one of the foundations of science is independent replicability and verifiability. In practice, for me to be able to replicate and verify your computational analysis and results, I will need to be able to see your source code, compile it myself, and potentially modify it. These requirements in effect necessitate some sort of open source model, in the broadest sense of the term. To take one of your examples, the Ms-RSL license — I can’t effectively replicate and verify your results if I’m legally prohibited from compiling and modifying your source code, so the Ms-RSL is out. A paper describing software should properly describe the algorithms to ensure the reproducibility. *Should*. In practice, we all know (those programmers among us do, anyway) that descriptions of source code do not suffice. The source should be available for inspection to ensure the program does what was claimed, for all I care this can be under the Ms-RSL license or just under good-old copyright. The program should preferably be available free for academic users, but if the paper is good you should be able to re-implement the tool if it is too expensive or doesn't exactly do what you want so it isn't entirely necessary. Making the software open source (in an OSS sense) does not solve any problems that a good description of the algorithms doesn't do well already. This is just wildly wrong. It’s basically impossible to ensure and verify that a “good description of the algorithm actually corresponds to the source code without seeing, using, and modifying the source. To take an experimental analogy — my lab has endured several cases where we read a “good published description of the subcloning and sequencing of some vector, only to find that the detailed published description is wrong when we are given the chance to analyze the vector ourselves. It happens all the time, and computer code is no different in this respect. OSS does not guarantee long-term availability, a paper will like outlive the software repository. OSS licenses (not the BSD license) can be so restrictive that you end up having to re-implement the algorithms anyway. So not having an OSS license should not be a reason to reject the paper about the software. Cheers, Robbie -Original Message- From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of James Stroud Sent: Tuesday, May 12, 2015 20:40 To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] [RANT] Reject Papers describing non-open source software On May 12, 2015, at 12:29 PM, Roger Rowlett rrowl...@colgate.edu wrote: Was the research publicly funded? If you receive funds from NSF, for example, you are expected to share and make widely available and usable software and inventions created under a grant (section VI.D.4. of the Award and administration guide). I don't know how enforceable that clause is, however. The funding shouldn't matter. I suggest that a publication that has the purpose of describing non-open source software should be summarily rejected by referees. In other words, the power is in our hands, not the NSF's.
Re: [ccp4bb] ctruncate bug?
Hi Randy, So I've been playing around with equations myself, and I have some alternative results. As I understand your Mathematica stuff, you are using the data model: ip = ij + ib' ib where ip is the measured peak (before any background correction), and ij is a random sample from the true intensity j. Here ib is the measured background, whereas ib' is the background absorbed into ip. Both ib and ib' are a random sample from background jb. Again, only ip and ib are observed; ij and ib' are hidden variables. Now let me recap your treatment of that model (hopefully I get this right). You assume Poisson distributions for ip, ij, ib, and ib', and find the joint probability of observed ip and ib given j and jb, p(ip,ib|j,jb). You can consider ip and ib as statistically independent, since ip depends on ib', not ib. You then marginalize over jb (the true background intensity) using a flat uninformative prior, giving p(ip,ib|j). You find that p(ip,ib|j) is similar to FW's p(ip-ib|j, sdj), where sdj=sqrt(ip+ib). Some sort of scaling is necessary, since in practice ib and ip are counted from different numbers of pixels. You find that, for roughly equal scaling, the Poisson version is similar to FW's Gaussian approximation for even moderate counts. However, in practice, we measure the background from a much larger area than the spot. For example, in the mosflm window I have open now, the background area is 20 times the spot area, for high res, low SNR spots. Similarly, in xds the spot-to-background ratio, in terms of pixel #, is 10 on average and 5 for the great majority of spots. Therefore, we typically know the value of jb to a much better precision than what we can get from ip (which is essentially an estimate of j+jb). If the relative sd of the background is about 2 or 3 times less than that of the spot ip, we can approximate the background estimate of jb as a constant (ie, ignore the uncertainty in its value). This will be valid if the total area used for the background measurement is roughly 5 times the area of the spot (even less for negative peaks). So what we can do is estimate jb using ib, and then find the conditional distribution of j given ip and jb. Using your notation, this distribution is given by: p(j|ip,jb) = exp(-(jb+j)) (jb+j)^ip / Gamma(ip+1,jb) where Gamma(.,.) is the upper incomplete gamma function. The moments of this distribution have nice analytical forms (well, at least as nice as FW's). Here's a table comparing the FW estimates to this Poisson treatment, using Randy's ip and jb values, plus some others: ip jbExp[j]_fw SD[j]_fw h Exp[j]_dt SD[j]_dt %diff - --- - - 55 45 11.36.3 1.3 11.9 6.8 5.3 45 55 3.02.6 -1.53.7 3.3 5.4 35 65 1.11.1 -5.12.0 2.086 6 10 1.00.91 -1.61.8 1.780 13 0.37 0.34 -2.01.3 1.2 240 4 12 0.45 0.43 -4.01.4 1.3 210 100 100 8.06.0 0 8.6 6.6 7.4 85 100 3.93.4 -1.64.7 4.220 75 100 2.52.4 -2.93.4 3.235 500 500 17.8 13.5 0 18.4 14.0 3.3 440 500 6.25.8 -2.97.0 6.614 1000 1000 25.2 19.1 0 25.8 21 2.3 920 1000 9.48.8 -2.6 10.3 9.5 9.1 940 1000 11.6 10.5 -2.0 12.4 11 7 In this table I've used sdj=sqrt(ip) for FW, since I'm ignoring the uncertainty in jb --- Randy used sqrt(ip+ib). h = (ip-jb)/sdj %diff = (Exp[j]_dt - Exp[j]_fw)/Exp[j]_fw Here jb is the # background counts normalized to have the same pixel area as ip. Whether these would be considered important differences, I'm not sure. The differences are greatest when ipjb (that is, for negative intensities). As an aside: It's easy to expand this to include the acentric Wilson prior: p(j|ip,jb,w) = exp(-(jb+j)(w+1)) (jb+j)^ip (w+1)^(ip+1) / Gamma(ip+1,jb(w+1)) where w = 1/sigma_w, sigma_w = the Wilson sigma. Again, the moments have analytical forms. On Jul 1, 2013, at 5:47 AM, Randy Read rj...@cam.ac.uk wrote: Hi, I've been following this discussion, and I was particularly interested by the suggestion that some information might be lost by turning the separate peak and background measurements into a single difference. I accept the point that there might be value in, e.g., TDS models that pay explicit attention to non-Bragg intensities, but this whole discussion started from the point of what estimates to use for diffracted Bragg intensities in processes such as molecular replacement, refinement, and map calculations. I thought I'd run this past the two of you, in case I've
Re: [ccp4bb] ctruncate bug?
On Jul 7, 2013, at 1:44 PM, Ian Tickle ianj...@gmail.com wrote: On 29 June 2013 01:13, Douglas Theobald dtheob...@brandeis.edu wrote: I admittedly don't understand TDS well. But I thought it was generally assumed that TDS contributes rather little to the conventional background measurement outside of the spot (so Stout and Jensen tells me :). So I was not even really considering TDS, which I see as a different problem from measuring background (am I mistaken here?). I thought the background we measure (in the area surrounding the spot) mostly came from diffuse solvent scatter, air scatter, loop scatter, etc. If so, then we can just consider Itrue = Ibragg + Itds, and worry about modeling the different components of Itrue at a different stage. And then it would make sense to think about blocking a reflection (say, with a minuscule, precisely positioned beam stop very near the crystal) and measuring the background in the spot where the reflection would hit. That background should be approximated pretty well by Iback, the background around the spot (especially if we move far enough away from the spot so that TDS is negligible there). Stout Jensen would not be my first choice to learn about TDS! It's a textbook of small-molecule crystallography (I know, it was my main textbook during my doctorate on small-molecule structures), and small molecules are generally more highly ordered than macromolecules and therefore exhibit TDS on a much smaller scale (there are exceptions of course). I think what you are talking about is acoustic mode TDS (so-called because of its relationship with sound transmission through a crystal), which peaks under the Bragg spots and is therefore very hard to distinguish from it. The other two contributors to TDS that are often observed in MX are optic mode and Einstein model. TDS arises from correlated motions within the crystal, for acoustic mode it's correlated motions of whole unit cells within the lattice, for optic mode it's correlations of different parts of a unit cell (e.g. correlated domain motions in a protein), and for Einstein model it's correlations of the movement of electrons as they are carried along by vibrating atoms (an Einstein solid is a simple model of a crystal proposed by A. Einstein consisting of a collection of independent quantised harmonic-isotropic oscillators; I doubt he was aware of its relevance to TDS, that came later). Here's an example of TDS: http://people.cryst.bbk.ac.uk/~tickle/iucr99/tds2f.gif . The acoustic mode gives the haloes around the Bragg spots (but as I said mainly coincides with the spots), the optic mode gives the nebulous blobs, wisps and streaks that are uncorrelated with the Bragg spots (you can make out an inner ring of 14 blobs due to the 7-fold NCS), and the Einstein model gives the isotropic uniform greying increasing towards the outer edge (makes it look like the diffraction pattern has been projected onto a sphere). So I leave you to decide whether TDS contributes to the background! That's all very interesting --- do you have a good ref for TDS where I can read up on the theory/practice? My protein xtallography books say even less than SJ about TDS. Anyway, this appears to be a problem beyond the scope of this present discussion --- in an ideal world we'd be modeling all the forms of TDS, and Bragg diffraction, and comparing those predictions to the intensity pattern over the entire detector --- not just integrating near the reciprocal lattice points. Going on what you said above, it seems the acoustic component can't really be measured independently of the Bragg peak, while the optic and Einstein components can, or least can be estimated pretty well from the intensity around the Bragg peak (which means we can treat it as background). In any case, I'm going to ignore the TDS complications for now. :) As for the blocking beam stop, every part of the crystal (or at least every part that's in the beam) contributes to every part of the diffraction pattern (i.e. Fourier transform). This means that your beam stop would have to mask the whole crystal - any small bit of the crystal left unmasked and exposed to the beam would give a complete diffraction pattern! That means you wouldn't see anything, not even the background! That's all true, but you can detect peaks independently of one another on a detector, so obviously there is some minimal distance away from a crystal where you could completely block any given reflection and nothing else. Clearly the reflection stop would have to be the size of the crystal (or at least the beam). You could leave a small hole in the centre for the direct beam and that would give you the air scatter contribution, but usually the air path is minimal anyway so that's only a very small contribution to the total background. But let's say by some magic you were able to measure only the background, say
Re: [ccp4bb] ctruncate bug?
On Jun 27, 2013, at 12:30 PM, Ian Tickle ianj...@gmail.com wrote: On 22 June 2013 19:39, Douglas Theobald dtheob...@brandeis.edu wrote: So I'm no detector expert by any means, but I have been assured by those who are that there are non-Poissonian sources of noise --- I believe mostly in the readout, when photon counts get amplified. Of course this will depend on the exact type of detector, maybe the newest have only Poisson noise. Sorry for delay in responding, I've been thinking about it. It's indeed possible that the older detectors had non-Poissonian noise as you say, but AFAIK all detectors return _unsigned_ integers (unless possibly the number is to be interpreted as a flag to indicate some error condition, but then obviously you wouldn't interpret it as a count). So whatever the detector AFAIK it's physically impossible for it to return a negative number that is to be interpreted as a photon count (of course the integration program may interpret the count as a _signed_ integer but that's purely a technical software issue). Just because the detectors spit out positive numbers (unsigned ints) does not mean that those values are Poisson distributed. As I understand it, the readout can introduce non-Poisson noise, which is usually modeled as Gaussian. I think we're all at least agreed that, whatever the true distribution of Ispot (and Iback) is, it's not in general Gaussian, except as an approximation in the limit of large Ispot and Iback (with the proviso that under this approximation Ispot Iback can never be negative). Certainly the assumption (again AFAIK) has always been that var(count) = count and I think I'm right in saying that only a Poisson distribution has that property? I think you mean that the Poisson has the property that mean(x) = var(x) (and since the ML estimate of the mean = count, you get your equation). Many other distributions can approximate that (most of the binomial variants with small p). Also, the standard gamma distribution with scale parameter=1 has that exact property. No, its just terminology. For you, Iobs is defined as Ispot-Iback, and that's fine. (As an aside, assuming the Poisson model, this Iobs will have a Skellam distribution, which can take negative values and asymptotically approaches a Gaussian.) The photons contributed to Ispot from Itrue will still be Poisson. Let's call them something besides Iobs, how about Ireal? Then, the Poisson model is Ispot = Ireal + Iback' where Ireal comes from a Poisson with mean Itrue, and Iback' comes from a Poisson with mean Iback_true. The same likelihood function follows, as well as the same points. You're correct that we can't directly estimate Iback', but I assume that Iback (the counts around the spot) come from the same Poisson with mean Iback_true (as usual). So I would say, sure, you have defined Iobs, and it has a Skellam distribution, but what, if anything, does that Iobs have to do with Itrue? My point still holds, that your Iobs is not a valid estimate of Itrue when IspotIback. Iobs as an estimate of Itrue requires unphysical assumptions, namely that photon counts can be negative. It is impossible to derive Ispot-Iback as an estimate for Itrue (when IspotIback) *unless* you make that unphysical assumption (like the Gaussian model). Please note that I have never claimed that Iobs = Ispot - Iback is to be interpreted as an estimate of Itrue, indeed quite the opposite: I agree completely that Iobs has little to do with Itrue when Iobs is negative. In fact I don't believe anyone else is claiming that Iobs is to be interpreted as an estimate of Itrue either, so maybe this is the source of the misunderstanding? Maybe it is, but that has its own problems. I imagine that most people who collect an X-ray dataset think that the intensities in their mtz are indeed estimates of the true intensities from their crystal. Seems like a reasonable thing to expect, especially since the fourier of our model is supposed to predict Itrue. If Iobs is not an estimate of Itrue, what exactly is its relevance to the structure inference problem? Maybe it only serves as a way-station on the road to the French-Wilson correction? As I understand it, not everyone uses ctruncate. Certainly for me Ispot - Iback is merely the difference between the two measurements, nothing more. Maybe if we called it something other than Iobs (say Idiff), or even avoided giving it a name altogether that would avoid any further confusion? Perhaps this whole discussion has been merely about terminology? I'm also puzzled as to your claim that Iback' is not Poisson. I don't think your QM argument is relevant, since we can imagine what we would have detected at the spot if we'd blocked the reflection, and that # of photon counts would be Poisson. That is precisely the conventional logic behind
Re: [ccp4bb] ctruncate bug?
Ian, I really do think we are almost saying the same thing. Let me try to clarify. You say that the Gaussian model is not the correct data model, and that the Poisson is correct. I more-or-less agree. If I were being pedantic (me?) I would say that the Poisson is *more* physically realistic than the Gaussian, and more realistic in a very important and relevant way --- but in truth the Poisson model does not account for other physical sources of error that arise from real crystals and real detectors, such as dark noise and read noise (that's why I would prefer a gamma distribution). I also agree that for x10 the Gaussian is a good approximation to the Poisson. I basically agree with every point you make about the Poisson vs the Gaussian, except for the following. The Iobs=Ispot-Iback equation cannot be derived from a Poisson assumption, except as an approximation when Ispot Iback. It *can* be derived from the Gaussian assumption (and in fact I think that is probably the *only* justification it has). It is true that the difference between two Poissons can be negative. It is also true that for moderate # of counts, the Gaussian is a good approximation to the Poisson. But we are trying to estimate Itrue, and both of those points are irrelevant to estimating Itrue when Ispot Iback. Contrary to your assertion, we are not concerned with differences of Poissonians, only sums. Here is why: In the Poisson model you outline, Ispot is the sum of two Poisson variables, Iback and Iobs. That means Ispot is also Poisson and can never be negative. Again --- the observed data (Ispot) is a *sum*, so that is what we must deal with. The likelihood function for this model is: L(a) = (a+b)^k exp(-a-b) where 'k' is the # of counts in Ispot, 'a' is the mean of the Iobs Poisson (i.e., a = Itrue), and 'b' is the mean of the Iback Poisson. Of course k=0, and both parameters a0 and b0. Our job is to estimate 'a', Itrue. Given the likelihood function above, there is no valid estimate of 'a' that will give a negative value. For example, the ML estimate of 'a' is always non-negative. Specifically, if we assume 'b' is known from background extrapolation, the ML estimate of 'a' is: a = k-b if kb a = 0 if k=b You can verify this visually by plotting the likelihood function (vs 'a' as variable) for any combination of k and b you want. The SD is a bit more difficult, but it is approximately (a+b)/sqrt(k), where 'a' is now the ML estimate of 'a'. Note that the ML estimate of 'a', when kb (IspotIback), is equivalent to Ispot-Iback. Now, to restate: as an estimate of Itrue, Ispot-Iback cannot be derived from the Poisson model. In contrast, Ispot-Iback *can* be derived from a Gaussian model (as the ML and LS estimate of Itrue). In fact, I'll wager the Gaussian is the only reasonable model that gives Ispot-Iback as an estimate of Itrue. This is why I claim that using Ispot-Iback as an estimate of Itrue, even when IspotIback, implicitly means you are using a (non-physical) Gaussian model. Feel free to prove me wrong --- can you derive Ispot-Iback, as an estimate of Itrue, from anything besides a Gaussian? Cheers, Douglas On Sat, Jun 22, 2013 at 12:06 PM, Ian Tickle ianj...@gmail.com wrote: On 21 June 2013 19:45, Douglas Theobald dtheob...@brandeis.edu wrote: The current way of doing things is summarized by Ed's equation: Ispot-Iback=Iobs. Here Ispot is the # of counts in the spot (the area encompassing the predicted reflection), and Iback is # of counts in the background (usu. some area around the spot). Our job is to estimate the true intensity Itrue. Ed and others argue that Iobs is a reasonable estimate of Itrue, but I say it isn't because Itrue can never be negative, whereas Iobs can. Now where does the Ispot-Iback=Iobs equation come from? It implicitly assumes that both Iobs and Iback come from a Gaussian distribution, in which Iobs and Iback can have negative values. Here's the implicit data model: Ispot = Iobs + Iback There is an Itrue, to which we add some Gaussian noise and randomly generate an Iobs. To that is added some background noise, Iback, which is also randomly generated from a Gaussian with a true mean of Ibtrue. This gives us the Ispot, the measured intensity in our spot. Given this data model, Ispot will also have a Gaussian distribution, with mean equal to the sum of Itrue + Ibtrue. From the properties of Gaussians, then, the ML estimate of Itrue will be Ispot-Iback, or Iobs. Douglas, sorry I still disagree with your model. Please note that I do actually support your position, that Ispot-Iback is not the best estimate of Itrue. I stress that I am not arguing against this conclusion, merely (!) with your data model, i.e. you are arriving at the correct conclusion despite using the wrong model! So I think it's worth clearing that up. First off, I can assure you that there is no assumption, either implicit or explicit, that Ispot
Re: [ccp4bb] ctruncate bug?
On Sat, Jun 22, 2013 at 1:04 PM, Douglas Theobald dtheob...@brandeis.eduwrote: Feel free to prove me wrong --- can you derive Ispot-Iback, as an estimate of Itrue, from anything besides a Gaussian? OK, I'll prove myself wrong. Ispot-Iback can be derived as an estimate of Itrue, even when IspotIback, assuming a logistic model, laplace model, and probably others that allow negative values. I doubt anyone cares about these more exotic models, the point is that to get Ispot-Iback as an estimate of Itrue when IspotIback requires a non-physical model that allows negative photon counts and intensities.
Re: [ccp4bb] ctruncate bug?
On Sat, Jun 22, 2013 at 1:56 PM, Ian Tickle ianj...@gmail.com wrote: On 22 June 2013 18:04, Douglas Theobald dtheob...@brandeis.edu wrote: --- but in truth the Poisson model does not account for other physical sources of error that arise from real crystals and real detectors, such as dark noise and read noise (that's why I would prefer a gamma distribution). A photon counter is a digital device, not an analogue one. It starts at zero and adds 1 every time it detects a photon (or what it thinks is a photon). Once added, it is physically impossible for it to subtract 1 from its accumulated count: it contains no circuit to do that. It can certainly miss photons, so you end up with less than you should, and it can certainly 'see' photons where there were none (e.g. from instrumental noise), so you end up with more than you should. However once a count has been accumulated in the digital memory it stays there until the memory is cleared for the next measurement, and you can never end up with less than that accumulated count and in particular not less than zero; the bits of memory where the counts are accumulated are simply not programmed to return negative numbers. It has nothing to do with whether the crystal is real or not, all that matters is that photons from somewhere are arriving at and being counted by the detector. The accumulated counts at any moment in time have a Poisson distribution since the photons arrive completely randomly in time. I might add that if you are correct --- that the naive Poisson model is appropriate (perhaps true for the latest and greatest detectors, evidently Pilatus has no read-out noise or dark current) --- then the ML solution I outlined is a good one (much better than the crude Ispot-Iback background subtraction), and it provides rigorous SD estimates too.
Re: [ccp4bb] ctruncate bug?
On Jun 22, 2013, at 6:18 PM, Frank von Delft frank.vonde...@sgc.ox.ac.uk wrote: A fascinating discussion (I've learnt a lot!); a quick sanity check, though: In what scenarios would these improved estimates make a significant difference? Who knows? I always think that improved estimates are always a good thing, ignoring computational complexity (by improved I mean making more accurate physical assumptions). This may all be academic --- estimating Itrue with unphysical negative values, and then later correcting w/French-Wilson, may give approximately the same answers and make no tangible difference in the models. But that all seems a bit convoluted, ad hoc, and unnecessary, esp. now with the available computational power. It might make a difference. Or rather: are there any existing programs (as opposed to vapourware) that would benefit significantly? Cheers phx On 22/06/2013 18:04, Douglas Theobald wrote: Ian, I really do think we are almost saying the same thing. Let me try to clarify. You say that the Gaussian model is not the correct data model, and that the Poisson is correct. I more-or-less agree. If I were being pedantic (me?) I would say that the Poisson is *more* physically realistic than the Gaussian, and more realistic in a very important and relevant way --- but in truth the Poisson model does not account for other physical sources of error that arise from real crystals and real detectors, such as dark noise and read noise (that's why I would prefer a gamma distribution). I also agree that for x10 the Gaussian is a good approximation to the Poisson. I basically agree with every point you make about the Poisson vs the Gaussian, except for the following. The Iobs=Ispot-Iback equation cannot be derived from a Poisson assumption, except as an approximation when Ispot Iback. It *can* be derived from the Gaussian assumption (and in fact I think that is probably the *only* justification it has). It is true that the difference between two Poissons can be negative. It is also true that for moderate # of counts, the Gaussian is a good approximation to the Poisson. But we are trying to estimate Itrue, and both of those points are irrelevant to estimating Itrue when Ispot Iback. Contrary to your assertion, we are not concerned with differences of Poissonians, only sums. Here is why: In the Poisson model you outline, Ispot is the sum of two Poisson variables, Iback and Iobs. That means Ispot is also Poisson and can never be negative. Again --- the observed data (Ispot) is a *sum*, so that is what we must deal with. The likelihood function for this model is: L(a) = (a+b)^k exp(-a-b) where 'k' is the # of counts in Ispot, 'a' is the mean of the Iobs Poisson (i.e., a = Itrue), and 'b' is the mean of the Iback Poisson. Of course k=0, and both parameters a0 and b0. Our job is to estimate 'a', Itrue. Given the likelihood function above, there is no valid estimate of 'a' that will give a negative value. For example, the ML estimate of 'a' is always non-negative. Specifically, if we assume 'b' is known from background extrapolation, the ML estimate of 'a' is: a = k-b if kb a = 0 if k=b You can verify this visually by plotting the likelihood function (vs 'a' as variable) for any combination of k and b you want. The SD is a bit more difficult, but it is approximately (a+b)/sqrt(k), where 'a' is now the ML estimate of 'a'. Note that the ML estimate of 'a', when kb (IspotIback), is equivalent to Ispot-Iback. Now, to restate: as an estimate of Itrue, Ispot-Iback cannot be derived from the Poisson model. In contrast, Ispot-Iback *can* be derived from a Gaussian model (as the ML and LS estimate of Itrue). In fact, I'll wager the Gaussian is the only reasonable model that gives Ispot-Iback as an estimate of Itrue. This is why I claim that using Ispot-Iback as an estimate of Itrue, even when IspotIback, implicitly means you are using a (non-physical) Gaussian model. Feel free to prove me wrong --- can you derive Ispot-Iback, as an estimate of Itrue, from anything besides a Gaussian? Cheers, Douglas On Sat, Jun 22, 2013 at 12:06 PM, Ian Tickle ianj...@gmail.com wrote: On 21 June 2013 19:45, Douglas Theobald dtheob...@brandeis.edu wrote: The current way of doing things is summarized by Ed's equation: Ispot-Iback=Iobs. Here Ispot is the # of counts in the spot (the area encompassing the predicted reflection), and Iback is # of counts in the background (usu. some area around the spot). Our job is to estimate the true intensity Itrue. Ed and others argue that Iobs is a reasonable estimate of Itrue, but I say it isn't because Itrue can never be negative, whereas Iobs can. Now where does the Ispot-Iback=Iobs equation come from? It implicitly assumes that both Iobs and Iback come from a Gaussian
Re: [ccp4bb] ctruncate bug?
On Jun 21, 2013, at 8:36 AM, Ed Pozharski epozh...@umaryland.edu wrote: On 06/20/2013 01:07 PM, Douglas Theobald wrote: How can there be nothing wrong with something that is unphysical? Intensities cannot be negative. I think you are confusing two things - the true intensities and observed intensities. But I'm not. Let me try to convince you ... True intensities represent the number of photons that diffract off a crystal in a specific direction or, for QED-minded, relative probabilities of a single photon being found in a particular area of the detector when it's probability wave function finally collapses. I agree. True intensities certainly cannot be negative and in crystallographic method they never are. They are represented by the best theoretical estimates possible, Icalc. These are always positive. I also very much agree. Observed intensities are the best estimates that we can come up with in an experiment. I also agree with this, and this is the clincher. You are arguing that Ispot-Iback=Iobs is the best estimate we can come up with. I claim that is absurd. How are you quantifying best? Usually we have some sort of discrepancy measure between true and estimate, like RMSD, mean absolute distance, log distance, or somesuch. Here is the important point --- by any measure of discrepancy you care to use, the person who estimates Iobs as 0 when IbackIspot will *always*, in *every case*, beat the person who estimates Iobs with a negative value. This is an indisputable fact. These are determined by integrating pixels around the spot where particular reflection is expected to hit the detector. Unfortunately, science did not yet invent a method that would allow to suspend a crystal in vacuum while also removing all of the outside solvent. Neither we have included diffuse scatter in our theoretical model. Because of that, full reflection intensity contains background signal in addition to the Icalc. This background has to be subtracted and what is perhaps the most useful form of observation is Ispot-Iback=Iobs. How can that be the most useful form, when 0 is always a better estimate than a negative value, by any criterion? These observed intensities can be negative because while their true underlying value is positive, random errorsmay result in IbackIspot. There is absolutely nothing unphysical here. Yes there is. The only way you can get a negative estimate is to make unphysical assumptions. Namely, the estimate Ispot-Iback=Iobs assumes that both the true value of I and the background noise come from a Gaussian distribution that is allowed to have negative values. Both of those assumptions are unphysical. Replacing Iobs with E(J) is not only unnecessary, it's ill-advised as it will distort intensity statistics. For example, let's say you have translational NCS aligned with crystallographic axes, and hence some set of reflections is systematically absent. If all is well, Iobs~0 for the subset while E(J) is systematically positive. This obviously happens because the standard Wilson prior is wrong for these reflections, but I digress, as usual. In summary, there is indeed nothing wrong, imho, with negative Iobs. The fact that some of these may become negative is correctly accounted for once sigI is factored into the ML target. Cheers, Ed. -- Oh, suddenly throwing a giraffe into a volcano to make water is crazy? Julian, King of Lemurs
Re: [ccp4bb] ctruncate bug?
I kinda think we're saying the same thing, sort of. You don't like the Gaussian assumption, and neither do I. If you make the reasonable Poisson assumptions, then you don't get the Ispot-Iback=Iobs for the best estimate of Itrue. Except as an approximation for large values, but we are talking about the case when IbackIspot, where the Gaussian approximation to the Poisson no longer holds. The sum of two Poisson variates is also Poisson, which also can never be negative, unlike the Gaussian. So I reiterate: the Ispot-Iback=Iobs equation assumes Gaussians and hence negativity. The Ispot-Iback=Iobs does not follow from a Poisson assumption. On Jun 21, 2013, at 1:13 PM, Ian Tickle ianj...@gmail.com wrote: On 21 June 2013 17:10, Douglas Theobald dtheob...@brandeis.edu wrote: Yes there is. The only way you can get a negative estimate is to make unphysical assumptions. Namely, the estimate Ispot-Iback=Iobs assumes that both the true value of I and the background noise come from a Gaussian distribution that is allowed to have negative values. Both of those assumptions are unphysical. Actually that's not correct: Ispot and Iback are both assumed to come from a _Poisson_ distribution which by definition is zero for negative values of its argument (you can't have a negative number of photons), so are _not_ allowed to have negative values. For large values of the argument (in fact the approximation is pretty good even for x ~ 10) a Poisson approximates to a Gaussian, and then of course the difference Ispot-Iback is also approximately Gaussian. But I think that doesn't affect your argument. Cheers -- Ian
Re: [ccp4bb] ctruncate bug?
On Jun 20, 2013, at 2:13 PM, Ian Tickle ianj...@gmail.com wrote: Douglas, I think you are missing the point that estimation of the parameters of the proper Bayesian statistical model (i.e. the Wilson prior) in order to perform the integration in the manner you are suggesting requires knowledge of the already integrated intensities! Well, that's true, but that's how FW do it. They allow for negative integrated intensities. I'm arguing that we should not do that, since true intensities are positive, and so any estimate of them should also be positive. Examples are always better than words, so here goes (and I apologize for the length): The current way of doing things is summarized by Ed's equation: Ispot-Iback=Iobs. Here Ispot is the # of counts in the spot (the area encompassing the predicted reflection), and Iback is # of counts in the background (usu. some area around the spot). Our job is to estimate the true intensity Itrue. Ed and others argue that Iobs is a reasonable estimate of Itrue, but I say it isn't because Itrue can never be negative, whereas Iobs can. Now where does the Ispot-Iback=Iobs equation come from? It implicitly assumes that both Iobs and Iback come from a Gaussian distribution, in which Iobs and Iback can have negative values. Here's the implicit data model: Ispot = Iobs + Iback There is an Itrue, to which we add some Gaussian noise and randomly generate an Iobs. To that is added some background noise, Iback, which is also randomly generated from a Gaussian with a true mean of Ibtrue. This gives us the Ispot, the measured intensity in our spot. Given this data model, Ispot will also have a Gaussian distribution, with mean equal to the sum of Itrue + Ibtrue. From the properties of Gaussians, then, the ML estimate of Itrue will be Ispot-Iback, or Iobs. Now maybe you disagree with that Gaussian data model. If so, welcome to my POV. There are better models, ones that don't give Ispot-Iback as our best estimate of Itrue. Here is a simple example that incorporates our knowledge that Itrue cannot be negative (this example is primarily for illustrating the point, it's not exactly what I would recommend). Instead of using Gaussians, we will use Gamma distributions, which cannot be negative. We assume Iobs is distributed according to a Gamma(Itrue,1). The mean of this distribution is Itrue. (The Maxwell-Boltzmann energy distribution is also a gamma, just for comparison). We also assume that the noise is exponential (a special case of the gamma), Gamma(1,1). The mean of this distribution is 1. (You could imagine that you've normalized Ispot relative to its background --- again, just for ease of calculation). We still assume that Ispot = Iobs + Iback. Then, Ispot will also have a gamma distribution, Gamma(Itrue+1,1). The mean of the Ispot distribution, as you might expect, is Itrue+1. Now we measure Ispot. Given Ispot, the ML estimate of Itrue is: InvDiGamma[ln(Ispot)]-1 if Ispot0.561 or 0 if Ispot0.561 Note, the ML estimate is no longer Iobs, and the ML estimate cannot be negative. InvDiGamma is the inverse Digamma function --- a bit unusual, but easily calculated (actually no weirder than the exponential or logarithm, its a relative of factorial and the gamma function). Not something the Braggs would've used, but hey, we've got iPhones now. We can also estimate the SD of of our estimate, but I won't bore you with the equation. A few examples: Ispot ML Itrue SD - -- 0.5 0 0.78 0.6 0.04 0.80 0.8 0.25 0.91 0.9 0.36 0.97 1.0 0.46 1.0 1.5 0.97 1.2 2.0 1.48 1.4 3.0 2.49 1.7 5.0 4.49 2.2 10.09.50 3.2 20.019.5 4.5 100 99.5 10 Note that the first four entries in the table are the case when IspotIback. No negative estimates. You'd get qualitatively similar results if you assume Poisson for Iback and Ispot. To sum up --- the equation Ispot-Iback=Iobs is unphysical because it is founded on unphysical assumptions. If you make better physical assumptions (i.e., Itrue cannot be negative), you end up with different estimates for Itrue. I suppose we could iterate, i.e. assume an approximate prior, integrate, calculate a better prior, re-do the integration with the new prior and so on (hoping of course that the whole process converges), but I think most people would regard that as overkill. Also dealing with the issue of averaging estimates of intensities that no longer have a Gaussian error distribution, and also crucially outlier rejection, would require some rethinking of the algorithms. The question is would it make any difference in the end compared with the 'post-correction' we're doing now? Cheers -- Ian On 20 June 2013 18:14, Douglas Theobald dtheob...@brandeis.edu wrote: I still don't see how you get a negative
Re: [ccp4bb] ctruncate bug?
On Jun 21, 2013, at 2:48 PM, Ed Pozharski epozh...@umaryland.edu wrote: Douglas, Observed intensities are the best estimates that we can come up with in an experiment. I also agree with this, and this is the clincher. You are arguing that Ispot-Iback=Iobs is the best estimate we can come up with. I claim that is absurd. How are you quantifying best? Usually we have some sort of discrepancy measure between true and estimate, like RMSD, mean absolute distance, log distance, or somesuch. Here is the important point --- by any measure of discrepancy you care to use, the person who estimates Iobs as 0 when IbackIspot will *always*, in *every case*, beat the person who estimates Iobs with a negative value. This is an indisputable fact. First off, you may find it useful to avoid such words as absurd and indisputable fact. I know political correctness may be sometimes overrated, but if you actually plan to have meaningful discussion, let's assume that everyone responding to your posts is just trying to help figure this out. I apologize for offending and using the strong words --- my intention was not to offend. This is just how I talk when brainstorming with my colleagues around a blackboard, but of course then you can see that I smile when I say it. To address your point, you are right that J=0 is closer to true intensity then a negative value. The problem is that we are not after a single intensity, but rather all of them, as they all contribute to electron density reconstruction. If you replace negative Iobs with E(J), you would systematically inflate the averages, which may turn problematic in some cases. So, I get the point. But even then, using any reasonable criterion, the whole estimated dataset will be closer to the true data if you set all negative intensity estimates to 0. It is probably better to stick with raw intensities and construct theoretical predictions properly to account for their properties. What I was trying to tell you is that observed intensities is what we get from experiment. But they are not what you get from the detector. The detector spits out a positive value for what's inside the spot. It is we, as human agents, who later manipulate and massage that data value by subtracting the background estimate. A value that has been subjected to a crude background subtraction is not the raw experimental value. It has been modified, and there must be some logic to why we massage the data in that particular manner. I agree, of course, that the background should be accounted for somehow. But why just subtract it away? There are other ways to massage the data --- see my other post to Ian. My argument is that however we massage the experimentally observed value should be physically informed, and allowing negative intensity estimates violates the basic physics. [snip] These observed intensities can be negative because while their true underlying value is positive, random errorsmay result in IbackIspot. There is absolutely nothing unphysical here. Yes there is. The only way you can get a negative estimate is to make unphysical assumptions. Namely, the estimate Ispot-Iback=Iobs assumes that both the true value of I and the background noise come from a Gaussian distribution that is allowed to have negative values. Both of those assumptions are unphysical. See, I have a problem with this. Both common sense and laws of physics dictate that number of photons hitting spot on a detector is a positive number. There is no law of physics that dictates that under no circumstances there could be IspotIback. That's not what I'm saying. Sure, Ispot can be less than Iback randomly. That does not mean we have to estimate the detected intensity as negative, after accounting for background. Yes, E(Ispot)=E(Iback). Yes, E(Ispot-Iback)=0. But P(Ispot-Iback=0)0, and therefore experimental sampling of Ispot-Iback is bound to occasionally produce negative values. What law of physics is broken when for a given reflection total number of photons in spot pixels is less that total number of photons in equal number of pixels in the surrounding background mask? Cheers, Ed. -- Oh, suddenly throwing a giraffe into a volcano to make water is crazy? Julian, King of Lemurs
Re: [ccp4bb] ctruncate bug?
On Jun 21, 2013, at 2:52 PM, James Holton jmhol...@lbl.gov wrote: Yes, but the DIFFERENCE between two Poisson-distributed values can be negative. This is, unfortunately, what you get when you subtract the background out from under a spot. Perhaps this is the source of confusion here? Maybe, but if you assume Poisson background and intensities, the ML estimate when background measured intensity is not negative, nor is it the difference Ispot-Iback. The ML estimate is 0. (With a finite non-zero SD, smaller SD the smaller the Ispot/Iback ratio). On Fri, Jun 21, 2013 at 11:34 AM, Douglas Theobald dtheob...@brandeis.edu wrote: I kinda think we're saying the same thing, sort of. You don't like the Gaussian assumption, and neither do I. If you make the reasonable Poisson assumptions, then you don't get the Ispot-Iback=Iobs for the best estimate of Itrue. Except as an approximation for large values, but we are talking about the case when IbackIspot, where the Gaussian approximation to the Poisson no longer holds. The sum of two Poisson variates is also Poisson, which also can never be negative, unlike the Gaussian. So I reiterate: the Ispot-Iback=Iobs equation assumes Gaussians and hence negativity. The Ispot-Iback=Iobs does not follow from a Poisson assumption. On Jun 21, 2013, at 1:13 PM, Ian Tickle ianj...@gmail.com wrote: On 21 June 2013 17:10, Douglas Theobald dtheob...@brandeis.edu wrote: Yes there is. The only way you can get a negative estimate is to make unphysical assumptions. Namely, the estimate Ispot-Iback=Iobs assumes that both the true value of I and the background noise come from a Gaussian distribution that is allowed to have negative values. Both of those assumptions are unphysical. Actually that's not correct: Ispot and Iback are both assumed to come from a _Poisson_ distribution which by definition is zero for negative values of its argument (you can't have a negative number of photons), so are _not_ allowed to have negative values. For large values of the argument (in fact the approximation is pretty good even for x ~ 10) a Poisson approximates to a Gaussian, and then of course the difference Ispot-Iback is also approximately Gaussian. But I think that doesn't affect your argument. Cheers -- Ian
Re: [ccp4bb] ctruncate bug?
Just trying to understand the basic issues here. How could refining directly against intensities solve the fundamental problem of negative intensity values? On Jun 20, 2013, at 11:34 AM, Bernhard Rupp hofkristall...@gmail.com wrote: As a maybe better alternative, we should (once again) consider to refine against intensities (and I guess George Sheldrick would agree here). I have a simple question - what exactly, short of some sort of historic inertia (or memory lapse), is the reason NOT to refine against intensities? Best, BR
Re: [ccp4bb] ctruncate bug?
Seems to me that the negative Is should be dealt with early on, in the integration step. Why exactly do integration programs report negative Is to begin with? On Jun 20, 2013, at 12:45 PM, Dom Bellini dom.bell...@diamond.ac.uk wrote: Wouldnt be possible to take advantage of negative Is to extrapolate/estimate the decay of scattering background (kind of Wilson plot of background scattering) to flat out the background and push all the Is to positive values? More of a question rather than a suggestion ... D From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Ian Tickle Sent: 20 June 2013 17:34 To: ccp4bb Subject: Re: [ccp4bb] ctruncate bug? Yes higher R factors is the usual reason people don't like I-based refinement! Anyway, refining against Is doesn't solve the problem, it only postpones it: you still need the Fs for maps! (though errors in Fs may be less critical then). -- Ian On 20 June 2013 17:20, Dale Tronrud det...@uoxray.uoregon.edumailto:det...@uoxray.uoregon.edu wrote: If you are refining against F's you have to find some way to avoid calculating the square root of a negative number. That is why people have historically rejected negative I's and why Truncate and cTruncate were invented. When refining against I, the calculation of (Iobs - Icalc)^2 couldn't care less if Iobs happens to be negative. As for why people still refine against F... When I was distributing a refinement package it could refine against I but no one wanted to do that. The R values ended up higher, but they were looking at R values calculated from F's. Of course the F based R values are lower when you refine against F's, that means nothing. If we could get the PDB to report both the F and I based R values for all models maybe we could get a start toward moving to intensity refinement. Dale Tronrud On 06/20/2013 09:06 AM, Douglas Theobald wrote: Just trying to understand the basic issues here. How could refining directly against intensities solve the fundamental problem of negative intensity values? On Jun 20, 2013, at 11:34 AM, Bernhard Rupp hofkristall...@gmail.commailto:hofkristall...@gmail.com wrote: As a maybe better alternative, we should (once again) consider to refine against intensities (and I guess George Sheldrick would agree here). I have a simple question - what exactly, short of some sort of historic inertia (or memory lapse), is the reason NOT to refine against intensities? Best, BR -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
Re: [ccp4bb] ctruncate bug?
How can there be nothing wrong with something that is unphysical? Intensities cannot be negative. How could you measure a negative number of photons? You can only have a Gaussian distribution around I=0 if you are using an incorrect, unphysical statistical model. As I understand it, the physics predicts that intensities from diffraction should be gamma distributed (i.e., the square of a Gaussian variate), which makes sense as the gamma distribution assigns probability 0 to negative values. On Jun 20, 2013, at 1:00 PM, Bernard D Santarsiero b...@uic.edu wrote: There's absolutely nothing wrong with negative intensities. They are measurements of intensities that are near zero, and some will be negative, and others positive. The distribution around I=0 can still be Gaussian, and you have true esd's. With F's you used a derived esd since they can't be formally generated from the sigma's on I, and are very much undetermined for small intensities and small F's. Small molecule crystallographers routinely refine on F^2 and use all of the data, even if the F^2's are negative. Bernie On Jun 20, 2013, at 11:49 AM, Douglas Theobald wrote: Seems to me that the negative Is should be dealt with early on, in the integration step. Why exactly do integration programs report negative Is to begin with? On Jun 20, 2013, at 12:45 PM, Dom Bellini dom.bell...@diamond.ac.uk wrote: Wouldnt be possible to take advantage of negative Is to extrapolate/estimate the decay of scattering background (kind of Wilson plot of background scattering) to flat out the background and push all the Is to positive values? More of a question rather than a suggestion ... D From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Ian Tickle Sent: 20 June 2013 17:34 To: ccp4bb Subject: Re: [ccp4bb] ctruncate bug? Yes higher R factors is the usual reason people don't like I-based refinement! Anyway, refining against Is doesn't solve the problem, it only postpones it: you still need the Fs for maps! (though errors in Fs may be less critical then). -- Ian On 20 June 2013 17:20, Dale Tronrud det...@uoxray.uoregon.edumailto:det...@uoxray.uoregon.edu wrote: If you are refining against F's you have to find some way to avoid calculating the square root of a negative number. That is why people have historically rejected negative I's and why Truncate and cTruncate were invented. When refining against I, the calculation of (Iobs - Icalc)^2 couldn't care less if Iobs happens to be negative. As for why people still refine against F... When I was distributing a refinement package it could refine against I but no one wanted to do that. The R values ended up higher, but they were looking at R values calculated from F's. Of course the F based R values are lower when you refine against F's, that means nothing. If we could get the PDB to report both the F and I based R values for all models maybe we could get a start toward moving to intensity refinement. Dale Tronrud On 06/20/2013 09:06 AM, Douglas Theobald wrote: Just trying to understand the basic issues here. How could refining directly against intensities solve the fundamental problem of negative intensity values? On Jun 20, 2013, at 11:34 AM, Bernhard Rupp hofkristall...@gmail.commailto:hofkristall...@gmail.com wrote: As a maybe better alternative, we should (once again) consider to refine against intensities (and I guess George Sheldrick would agree here). I have a simple question - what exactly, short of some sort of historic inertia (or memory lapse), is the reason NOT to refine against intensities? Best, BR -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
Re: [ccp4bb] ctruncate bug?
I still don't see how you get a negative intensity from that. It seems you are saying that in many cases of a low intensity reflection, the integrated spot will be lower than the background. That is not equivalent to having a negative measurement (as the measurement is actually positive, and sometimes things are randomly less positive than backgroiund). If you are using a proper statistical model, after background correction you will end up with a positive (or 0) value for the integrated intensity. On Jun 20, 2013, at 1:08 PM, Andrew Leslie and...@mrc-lmb.cam.ac.uk wrote: The integration programs report a negative intensity simply because that is the observation. Because of noise in the Xray background, in a large sample of intensity estimates for reflections whose true intensity is very very small one will inevitably get some measurements that are negative. These must not be rejected because this will lead to bias (because some of these intensities for symmetry mates will be estimated too large rather than too small). It is not unusual for the intensity to remain negative even after averaging symmetry mates. Andrew On 20 Jun 2013, at 11:49, Douglas Theobald dtheob...@brandeis.edu wrote: Seems to me that the negative Is should be dealt with early on, in the integration step. Why exactly do integration programs report negative Is to begin with? On Jun 20, 2013, at 12:45 PM, Dom Bellini dom.bell...@diamond.ac.uk wrote: Wouldnt be possible to take advantage of negative Is to extrapolate/estimate the decay of scattering background (kind of Wilson plot of background scattering) to flat out the background and push all the Is to positive values? More of a question rather than a suggestion ... D From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Ian Tickle Sent: 20 June 2013 17:34 To: ccp4bb Subject: Re: [ccp4bb] ctruncate bug? Yes higher R factors is the usual reason people don't like I-based refinement! Anyway, refining against Is doesn't solve the problem, it only postpones it: you still need the Fs for maps! (though errors in Fs may be less critical then). -- Ian On 20 June 2013 17:20, Dale Tronrud det...@uoxray.uoregon.edumailto:det...@uoxray.uoregon.edu wrote: If you are refining against F's you have to find some way to avoid calculating the square root of a negative number. That is why people have historically rejected negative I's and why Truncate and cTruncate were invented. When refining against I, the calculation of (Iobs - Icalc)^2 couldn't care less if Iobs happens to be negative. As for why people still refine against F... When I was distributing a refinement package it could refine against I but no one wanted to do that. The R values ended up higher, but they were looking at R values calculated from F's. Of course the F based R values are lower when you refine against F's, that means nothing. If we could get the PDB to report both the F and I based R values for all models maybe we could get a start toward moving to intensity refinement. Dale Tronrud On 06/20/2013 09:06 AM, Douglas Theobald wrote: Just trying to understand the basic issues here. How could refining directly against intensities solve the fundamental problem of negative intensity values? On Jun 20, 2013, at 11:34 AM, Bernhard Rupp hofkristall...@gmail.commailto:hofkristall...@gmail.com wrote: As a maybe better alternative, we should (once again) consider to refine against intensities (and I guess George Sheldrick would agree here). I have a simple question - what exactly, short of some sort of historic inertia (or memory lapse), is the reason NOT to refine against intensities? Best, BR -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
Re: [ccp4bb] ctruncate bug?
On Jun 20, 2013, at 1:47 PM, Felix Frolow mbfro...@post.tau.ac.il wrote: Intensity is subtraction: Inet=Iobs - Ibackground. Iobs and Ibackground can not be negative. Inet CAN be negative if background is higher than Iobs. Just to reiterate, we know that the true value of Inet cannot be negative. Hence, the equation you quote is invalid and illogical --- it has no physical or statistical justification (except as an approximation for large Iobs and low Iback, when ironically background correction is unnecessary). That equation does not account for random statistical fluctuations (e.g., simple Poisson counting statistics of shot noise). We do not know how to model background scattering modulated my molecular transform and mechanical motion of the molecule, I recall we have called it TDS - thermal diffuse scattering. Many years ago Boaz Shaanan and JH were fascinated by it. If we would know how deal with TDS, we would go to much nicer structures some of us like and for sure to much lower R factors all of us love excluding maybe referees who will claim over refinement :-\ Dr Felix Frolow Professor of Structural Biology and Biotechnology, Department of Molecular Microbiology and Biotechnology Tel Aviv University 69978, Israel Acta Crystallographica F, co-editor e-mail: mbfro...@post.tau.ac.il Tel: ++972-3640-8723 Fax: ++972-3640-9407 Cellular: 0547 459 608 On Jun 20, 2013, at 20:07 , Douglas Theobald dtheob...@brandeis.edu wrote: How can there be nothing wrong with something that is unphysical? Intensities cannot be negative. How could you measure a negative number of photons? You can only have a Gaussian distribution around I=0 if you are using an incorrect, unphysical statistical model. As I understand it, the physics predicts that intensities from diffraction should be gamma distributed (i.e., the square of a Gaussian variate), which makes sense as the gamma distribution assigns probability 0 to negative values. On Jun 20, 2013, at 1:00 PM, Bernard D Santarsiero b...@uic.edu wrote: There's absolutely nothing wrong with negative intensities. They are measurements of intensities that are near zero, and some will be negative, and others positive. The distribution around I=0 can still be Gaussian, and you have true esd's. With F's you used a derived esd since they can't be formally generated from the sigma's on I, and are very much undetermined for small intensities and small F's. Small molecule crystallographers routinely refine on F^2 and use all of the data, even if the F^2's are negative. Bernie On Jun 20, 2013, at 11:49 AM, Douglas Theobald wrote: Seems to me that the negative Is should be dealt with early on, in the integration step. Why exactly do integration programs report negative Is to begin with? On Jun 20, 2013, at 12:45 PM, Dom Bellini dom.bell...@diamond.ac.uk wrote: Wouldnt be possible to take advantage of negative Is to extrapolate/estimate the decay of scattering background (kind of Wilson plot of background scattering) to flat out the background and push all the Is to positive values? More of a question rather than a suggestion ... D From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Ian Tickle Sent: 20 June 2013 17:34 To: ccp4bb Subject: Re: [ccp4bb] ctruncate bug? Yes higher R factors is the usual reason people don't like I-based refinement! Anyway, refining against Is doesn't solve the problem, it only postpones it: you still need the Fs for maps! (though errors in Fs may be less critical then). -- Ian On 20 June 2013 17:20, Dale Tronrud det...@uoxray.uoregon.edumailto:det...@uoxray.uoregon.edu wrote: If you are refining against F's you have to find some way to avoid calculating the square root of a negative number. That is why people have historically rejected negative I's and why Truncate and cTruncate were invented. When refining against I, the calculation of (Iobs - Icalc)^2 couldn't care less if Iobs happens to be negative. As for why people still refine against F... When I was distributing a refinement package it could refine against I but no one wanted to do that. The R values ended up higher, but they were looking at R values calculated from F's. Of course the F based R values are lower when you refine against F's, that means nothing. If we could get the PDB to report both the F and I based R values for all models maybe we could get a start toward moving to intensity refinement. Dale Tronrud On 06/20/2013 09:06 AM, Douglas Theobald wrote: Just trying to understand the basic issues here. How could refining directly against intensities solve the fundamental problem of negative intensity values? On Jun 20, 2013, at 11:34 AM, Bernhard Rupp hofkristall...@gmail.commailto:hofkristall...@gmail.com wrote: As a maybe better alternative, we should (once again
Re: [ccp4bb] ctruncate bug?
Kay, I understand the French-Wilson way of currently doing things, as you outline below. My point is that it is not optimal --- we could do things better --- since even French-Wilson accepts the idea of negative intensity measurements. I am trying to disabuse the (very stubborn) view that when the background is more than the spot, the only possible estimate of the intensity is a negative value. This is untrue, and unjustified by the physics involved. In principle, there is no reason to use French-Wilson, as we should never have reported a negative integrated intensity to begin with. I also understand that (Iobs-Icalc)^2 is not the actual refinement target, but the same point applies, and the actual target is based on a fundamental Gaussian assumption for the Is. On Jun 20, 2013, at 2:13 PM, Kay Diederichs kay.diederi...@uni-konstanz.de wrote: Douglas, the intensity is negative if the integrated spot has a lower intensity than the estimate of the background under the spot. So yes, we are not _measuring_ negative intensities, rather we are estimating intensities, and that estimate may turn out to be negative. In a later step we try to correct for this, because it is non-physical, as you say. At that point, the proper statistical model comes into play. Essentially we use this as a prior. In the order of increasing information, we can have more or less informative priors for weak reflections: 1) I 0 2) I has a distribution looking like the right half of a Gaussian, and we estimate its width from the variance of the intensities in a resolution shell 3) I follows a Wilson distribution, and we estimate its parameters from the data in a resolution shell 4) I must be related to Fcalc^2 (i.e. once the structure is solved, we re-integrate using the Fcalc as prior) For a given experiment, the problem is chicken-and-egg in the sense that only if you know the characteristics of the data can you choose the correct prior. I guess that using prior 4) would be heavily frowned upon because there is a danger of model bias. You could say: A Bayesian analysis done properly should not suffer from model bias. This is probably true, but the theory to ensure the word properly is not available at the moment. Crystallographers usually use prior 3) which, as I tried to point out, also has its weak spots, namely if the data do not behave like those of an ideal crystal - and today's projects often result in data that would have been discarded ten years ago, so they are far from ideal. Prior 2) is available as an option in XDSCONV Prior 1) seems to be used, or is available, in ctruncate in certain cases (I don't know the details) Using intensities instead of amplitudes in refinement would avoid having to choose a prior, and refinement would therefore not be compromised in case of data violating the assumptions underlying the prior. By the way, it is not (Iobs-Icalc)^2 that would be optimized in refinement against intensities, but rather the corresponding maximum likelihood formula (which I seem to remember is more complicated than the amplitude ML formula, or is not an analytical formula at all, but maybe somebody knows better). best, Kay On Thu, 20 Jun 2013 13:14:28 -0400, Douglas Theobald dtheob...@brandeis.edu wrote: I still don't see how you get a negative intensity from that. It seems you are saying that in many cases of a low intensity reflection, the integrated spot will be lower than the background. That is not equivalent to having a negative measurement (as the measurement is actually positive, and sometimes things are randomly less positive than backgroiund). If you are using a proper statistical model, after background correction you will end up with a positive (or 0) value for the integrated intensity. On Jun 20, 2013, at 1:08 PM, Andrew Leslie and...@mrc-lmb.cam.ac.uk wrote: The integration programs report a negative intensity simply because that is the observation. Because of noise in the Xray background, in a large sample of intensity estimates for reflections whose true intensity is very very small one will inevitably get some measurements that are negative. These must not be rejected because this will lead to bias (because some of these intensities for symmetry mates will be estimated too large rather than too small). It is not unusual for the intensity to remain negative even after averaging symmetry mates. Andrew On 20 Jun 2013, at 11:49, Douglas Theobald dtheob...@brandeis.edu wrote: Seems to me that the negative Is should be dealt with early on, in the integration step. Why exactly do integration programs report negative Is to begin with? On Jun 20, 2013, at 12:45 PM, Dom Bellini dom.bell...@diamond.ac.uk wrote: Wouldnt be possible to take advantage of negative Is to extrapolate/estimate the decay of scattering background (kind of Wilson
Re: [ccp4bb] ctruncate bug?
Well, I tend to think Ian is probably right, that doing things the proper way (vs French-Wilson) will not make much of a difference in the end. Nevertheless, I don't think refining against the (possibly negative) intensities is a good solution to dealing with negative intensities --- that just ignores the problem, and will end up overweighting large negative intensities. Wouldn't it be better to correct the negative intensities with FW and then refine against that? On Jun 20, 2013, at 3:38 PM, Kay Diederichs kay.diederi...@uni-konstanz.de wrote: Douglas, as soon as you come up with an algorithm that gives accurate, unbiased intensity estimates together with their standard deviations, everybody will be happy. But I'm not aware of progress in this question (Poisson signal with background) in the last decades - I'd be glad to be proven wrong! Kay Am 20.06.13 21:27, schrieb Douglas Theobald: Kay, I understand the French-Wilson way of currently doing things, as you outline below. My point is that it is not optimal --- we could do things better --- since even French-Wilson accepts the idea of negative intensity measurements. I am trying to disabuse the (very stubborn) view that when the background is more than the spot, the only possible estimate of the intensity is a negative value. This is untrue, and unjustified by the physics involved. In principle, there is no reason to use French-Wilson, as we should never have reported a negative integrated intensity to begin with. I also understand that (Iobs-Icalc)^2 is not the actual refinement target, but the same point applies, and the actual target is based on a fundamental Gaussian assumption for the Is. On Jun 20, 2013, at 2:13 PM, Kay Diederichs kay.diederi...@uni-konstanz.de wrote: Douglas, the intensity is negative if the integrated spot has a lower intensity than the estimate of the background under the spot. So yes, we are not _measuring_ negative intensities, rather we are estimating intensities, and that estimate may turn out to be negative. In a later step we try to correct for this, because it is non-physical, as you say. At that point, the proper statistical model comes into play. Essentially we use this as a prior. In the order of increasing information, we can have more or less informative priors for weak reflections: 1) I 0 2) I has a distribution looking like the right half of a Gaussian, and we estimate its width from the variance of the intensities in a resolution shell 3) I follows a Wilson distribution, and we estimate its parameters from the data in a resolution shell 4) I must be related to Fcalc^2 (i.e. once the structure is solved, we re-integrate using the Fcalc as prior) For a given experiment, the problem is chicken-and-egg in the sense that only if you know the characteristics of the data can you choose the correct prior. I guess that using prior 4) would be heavily frowned upon because there is a danger of model bias. You could say: A Bayesian analysis done properly should not suffer from model bias. This is probably true, but the theory to ensure the word properly is not available at the moment. Crystallographers usually use prior 3) which, as I tried to point out, also has its weak spots, namely if the data do not behave like those of an ideal crystal - and today's projects often result in data that would have been discarded ten years ago, so they are far from ideal. Prior 2) is available as an option in XDSCONV Prior 1) seems to be used, or is available, in ctruncate in certain cases (I don't know the details) Using intensities instead of amplitudes in refinement would avoid having to choose a prior, and refinement would therefore not be compromised in case of data violating the assumptions underlying the prior. By the way, it is not (Iobs-Icalc)^2 that would be optimized in refinement against intensities, but rather the corresponding maximum likelihood formula (which I seem to remember is more complicated than the amplitude ML formula, or is not an analytical formula at all, but maybe somebody knows better). best, Kay On Thu, 20 Jun 2013 13:14:28 -0400, Douglas Theobald dtheob...@brandeis.edu wrote: I still don't see how you get a negative intensity from that. It seems you are saying that in many cases of a low intensity reflection, the integrated spot will be lower than the background. That is not equivalent to having a negative measurement (as the measurement is actually positive, and sometimes things are randomly less positive than backgroiund). If you are using a proper statistical model, after background correction you will end up with a positive (or 0) value for the integrated intensity. On Jun 20, 2013, at 1:08 PM, Andrew Leslie and...@mrc-lmb.cam.ac.uk wrote: The integration programs report a negative intensity simply because
Re: [ccp4bb] Strand distorsion and residue disconnectivity in pymol
To me, that's not a problem. The wavy representation is more accurate (as far as cartoon accuracy can go), as the strand actually follows the alpha carbons. This is why Pauling called it a pleated sheet --- it's got pleats. Beta sheets/strands *should* be wavy. On May 29, 2013, at 11:29 PM, wu donghui wdh0...@gmail.com wrote: Dear all, I found a problem when I use pymol to prepare structure interface. Strand is distorted when residue from the strand is connected to the strand by turning on side_chain_helper on. However when side_chain_helper is off, the strand turns to normal shape but the residue from it is disconnected to the strand. I attached the picture for your help. I know there must be some tricks for this. Welcome for any input. Thanks a lot. Best, Donghui Distorsion and connectivity in pymol for strand.pdf
Re: [ccp4bb] how to update phenix
On Mon, Feb 11, 2013 at 12:12 PM, Tim Gruene t...@shelx.uni-ac.gwdg.de wrote: -BEGIN PGP SIGNED MESSAGE-- Hash: SHA1 Dear Bill, I disagree to your criticism. From http://www.ccp4.ac.uk/ccp4bb.php: CCP4bb is an electronic mailing list intended to host discussions about topics of general interest to macromolecular crystallographers.[...] Personally I am only subscribed to three mailing lists and I refrain from subscribing to more which is one of the reasons why I welcome the liberal topic description of the ccp4bb. So the more appropriate analogy would be asking my mistress what to get my wife for V-day. Cheers, Tim On 02/10/2013 06:20 PM, William G. Scott wrote: On Feb 10, 2013, at 8:23 AM, LISA science...@gmail.com wrote: Hi all, My mac has the old version of phenix. How can i update to the new verison? Should I delete the old version and download the new version to install as the fist time ? Thanks lisa You can delete it and download a new version, or simply keep both. phenix has version labels on their binaries, for the enjoyment of those who use shell auto-completion. e.g.: fennario-% phenix.refine external command phenix.refine phenix.refine_1.8.1-1168 BTW, there is also a phenix bb. Asking about this here is kind of like asking my wife what I should get my (purely hypothetical) mistress for valentine's day. - -- - -- Dr Tim Gruene Institut fuer anorganische Chemie Tammannstr. 4 D-37077 Goettingen GPG Key ID = A46BEE1A -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iD8DBQFRGSZyUxlJ7aRr7hoRAokcAJ9u49SQJlPGpwn1oUDq5a1NuXkKIQCffCqH nEahBP42PZl763fdhR0NG2U= =ORTo -END PGP SIGNATURE-
Re: [ccp4bb] refining against weak data and Table I stats
On Dec 13, 2012, at 1:52 AM, James Holton jmhol...@lbl.gov wrote: [snip] So, what I would advise is to refine your model with data out to the resolution limit defined by CC*, but declare the resolution of the structure to be where the merged I/sigma(I) falls to 2. You might even want to calculate your Rmerge, Rcryst, Rfree and all the other R values to this resolution as well, since including a lot of zeroes does nothing but artificially drive up estimates of relative error. So James --- it appears that you basically agree with my proposal? I.e., (1) include all of the data in refinement (at least up to where CC1/2 or CC* is still significant) (2) keep the definition of resolution to what is more-or-less the defacto standard (res bin where I/sigI=2), (3) report Table I where everything is calculated up to this resolution (where I/sigI=2), and (4) maybe include in Supp Mat an additional table that reports statistics for all the data (I'm leaning towards a table with stats for each res bin) As you argued, and as I argued, this seems to be a good compromise, one that modifies current practice to include weak data, but nevertheless does not change the def of resolution or the Table I stats, so that we can still compare with legacy structures/stats. Perhaps we should even take a lesson from our small molecule friends and start reporting R1, where the R factor is computed only for hkls where I/sigma(I) is above 3? -James Holton MAD Scientist On 12/8/2012 4:04 AM, Miller, Mitchell D. wrote: I too like the idea of reporting the table 1 stats vs resolution rather than just the overall values and highest resolution shell. I also wanted to point out an earlier thread from April about the limitations of the PDB's defining the resolution as being that of the highest resolution reflection (even if data is incomplete or weak). https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1204L=ccp4bbD=01=ccp4bb9=AI=-3J=ond=No+Match%3BMatch%3BMatchesz=4P=376289 https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1204L=ccp4bbD=01=ccp4bb9=AI=-3J=ond=No+Match%3BMatch%3BMatchesz=4P=377673 What we have done in the past for cases of low completeness in the outer shell is to define the nominal resolution ala Bart Hazes' method of same number of reflections as a complete data set and use this in the PDB title and describe it in the remark 3 other refinement remarks. There is also the possibility of adding a comment to the PDB remark 2 which we have not used. http://www.wwpdb.org/documentation/format33/remarks1.html#REMARK%202 This should help convince reviewers that you are not trying to mis-represent the resolution of the structure. Regards, Mitch -Original Message- From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Edward A. Berry Sent: Friday, December 07, 2012 8:43 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] refining against weak data and Table I stats Yes, well, actually i'm only a middle author on that paper for a good reason, but I did encourage Rebecca and Stephan to use all the data. But on a later, much more modest submission, where the outer shell was not only weak but very incomplete (edges of the detector), the reviewers found it difficult to evaluate the quality of the data (we had also excluded a zone with bad ice-ring problems). So we provided a second table, cutting off above the ice ring in the good strong data, which convinced them that at least it is a decent 2A structure. In the PDB it is a 1.6A structure. but there was a lot of good data between the ice ring and 1.6 A. Bart Hazes (I think) suggested a statistic called effective resolution which is the resolution to which a complete dataset would have the number of reflectionin your dataset, and we reported this, which came out to something like 1.75. I do like the idea of reporting in multiple shells, not just overall and highest shell, and the PDB accomodatesthis, even has a GUI to enter it in the ADIT 2.0 software. It could also be used to report two different overall ranges, such as completeness, 25 to 1.6 A, which would be shocking in my case, and 25 to 2.0 which would be more reassuring. eab Douglas Theobald wrote: Hi Ed, Thanks for the comments. So what do you recommend? Refine against weak data, and report all stats in a single Table I? Looking at your latest V-ATPase structure paper, it appears you favor something like that, since you report a high res shell with I/sigI=1.34 and Rsym=1.65. On Dec 6, 2012, at 7:24 PM, Edward A. Berryber...@upstate.edu wrote: Another consideration here is your PDB deposition. If the reason for using weak data is to get a better structure, presumably you are going to deposit the structure using all the data. Then the statistics in the PDB file must reflect the high resolution refinement. There are I think three places in the PDB file where the resolution
Re: [ccp4bb] refining against weak data and Table I stats
Hi Ed, Thanks for the comments. So what do you recommend? Refine against weak data, and report all stats in a single Table I? Looking at your latest V-ATPase structure paper, it appears you favor something like that, since you report a high res shell with I/sigI=1.34 and Rsym=1.65. On Dec 6, 2012, at 7:24 PM, Edward A. Berry ber...@upstate.edu wrote: Another consideration here is your PDB deposition. If the reason for using weak data is to get a better structure, presumably you are going to deposit the structure using all the data. Then the statistics in the PDB file must reflect the high resolution refinement. There are I think three places in the PDB file where the resolution is stated, but i believe they are all required to be the same and to be equal to the highest resolution data used (even if there were only two reflections in that shell). Rmerge or Rsymm must be reported, and until recently I think they were not allowed to exceed 1.00 (100% error?). What are your reviewers going to think if the title of your paper is structure of protein A at 2.1 A resolution but they check the PDB file and the resolution was really 1.9 A? And Rsymm in the PDB is 0.99 but in your table 1* says 1.3? Douglas Theobald wrote: Hello all, I've followed with interest the discussions here about how we should be refining against weak data, e.g. data with I/sigI 2 (perhaps using all bins that have a significant CC1/2 per Karplus and Diederichs 2012). This all makes statistical sense to me, but now I am wondering how I should report data and model stats in Table I. Here's what I've come up with: report two Table I's. For comparability to legacy structure stats, report a classic Table I, where I call the resolution whatever bin I/sigI=2. Use that as my high res bin, with high res bin stats reported in parentheses after global stats. Then have another Table (maybe Table I* in supplementary material?) where I report stats for the whole dataset, including the weak data I used in refinement. In both tables report CC1/2 and Rmeas. This way, I don't redefine the (mostly) conventional usage of resolution, my Table I can be compared to precedent, I report stats for all the data and for the model against all data, and I take advantage of the information in the weak data during refinement. Thoughts? Douglas ^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^` Douglas L. Theobald Assistant Professor Department of Biochemistry Brandeis University Waltham, MA 02454-9110 dtheob...@brandeis.edu http://theobald.brandeis.edu/ ^\ /` /^. / /\ / / /`/ / . /` / / ' ' '
Re: [ccp4bb] refining against weak data and Table I stats
Hi Boaz, I read the KK paper as primarily a justification for including extremely weak data in refinement (and of course introducing a new single statistic that can judge data *and* model quality comparably). Using CC1/2 to gauge resolution seems like a good option, but I never got from the paper exactly how to do that. The resolution bin where CC1/2=0.5 seems natural, but in my (limited) experience that gives almost the same answer as I/sigI=2 (see also KK fig 3). On Dec 7, 2012, at 6:21 AM, Boaz Shaanan bshaa...@exchange.bgu.ac.il wrote: Hi, I'm sure Kay will have something to say about this but I think the idea of the K K paper was to introduce new (more objective) standards for deciding on the resolution, so I don't see why another table is needed. Cheers, Boaz Boaz Shaanan, Ph.D. Dept. of Life Sciences Ben-Gurion University of the Negev Beer-Sheva 84105 Israel E-mail: bshaa...@bgu.ac.il Phone: 972-8-647-2220 Skype: boaz.shaanan Fax: 972-8-647-2992 or 972-8-646-1710 From: CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UK] on behalf of Douglas Theobald [dtheob...@brandeis.edu] Sent: Friday, December 07, 2012 1:05 AM To: CCP4BB@JISCMAIL.AC.UK Subject: [ccp4bb] refining against weak data and Table I stats Hello all, I've followed with interest the discussions here about how we should be refining against weak data, e.g. data with I/sigI 2 (perhaps using all bins that have a significant CC1/2 per Karplus and Diederichs 2012). This all makes statistical sense to me, but now I am wondering how I should report data and model stats in Table I. Here's what I've come up with: report two Table I's. For comparability to legacy structure stats, report a classic Table I, where I call the resolution whatever bin I/sigI=2. Use that as my high res bin, with high res bin stats reported in parentheses after global stats. Then have another Table (maybe Table I* in supplementary material?) where I report stats for the whole dataset, including the weak data I used in refinement. In both tables report CC1/2 and Rmeas. This way, I don't redefine the (mostly) conventional usage of resolution, my Table I can be compared to precedent, I report stats for all the data and for the model against all data, and I take advantage of the information in the weak data during refinement. Thoughts? Douglas ^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^` Douglas L. Theobald Assistant Professor Department of Biochemistry Brandeis University Waltham, MA 02454-9110 dtheob...@brandeis.edu http://theobald.brandeis.edu/ ^\ /` /^. / /\ / / /`/ / . /` / / ' ' '
Re: [ccp4bb] refining against weak data and Table I stats
A good way to think about it is that if CC1/2=100%, that means you can split the data in half, and use one half to perfectly predict the corresponding values of the other half. So yes, perfect internal consistency. On Dec 7, 2012, at 11:41 AM, Phil Evans p...@mrc-lmb.cam.ac.uk wrote: It is internally consistent, though not necessarily correct On 7 Dec 2012, at 16:23, Alan Cheung wrote: Related to this, I've always wondered what CC1/2 values mean for low resolution. Not being mathematically inclined, I'm sure this is a naive question, but i'll ask anyway - what does CC1/2=100 (or 99.9) mean? Does it mean the data is as good as it gets? Alan On 07/12/2012 17:15, Douglas Theobald wrote: Hi Boaz, I read the KK paper as primarily a justification for including extremely weak data in refinement (and of course introducing a new single statistic that can judge data *and* model quality comparably). Using CC1/2 to gauge resolution seems like a good option, but I never got from the paper exactly how to do that. The resolution bin where CC1/2=0.5 seems natural, but in my (limited) experience that gives almost the same answer as I/sigI=2 (see also KK fig 3). On Dec 7, 2012, at 6:21 AM, Boaz Shaanan bshaa...@exchange.bgu.ac.il wrote: Hi, I'm sure Kay will have something to say about this but I think the idea of the K K paper was to introduce new (more objective) standards for deciding on the resolution, so I don't see why another table is needed. Cheers, Boaz Boaz Shaanan, Ph.D. Dept. of Life Sciences Ben-Gurion University of the Negev Beer-Sheva 84105 Israel E-mail: bshaa...@bgu.ac.il Phone: 972-8-647-2220 Skype: boaz.shaanan Fax: 972-8-647-2992 or 972-8-646-1710 From: CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UK] on behalf of Douglas Theobald [dtheob...@brandeis.edu] Sent: Friday, December 07, 2012 1:05 AM To: CCP4BB@JISCMAIL.AC.UK Subject: [ccp4bb] refining against weak data and Table I stats Hello all, I've followed with interest the discussions here about how we should be refining against weak data, e.g. data with I/sigI 2 (perhaps using all bins that have a significant CC1/2 per Karplus and Diederichs 2012). This all makes statistical sense to me, but now I am wondering how I should report data and model stats in Table I. Here's what I've come up with: report two Table I's. For comparability to legacy structure stats, report a classic Table I, where I call the resolution whatever bin I/sigI=2. Use that as my high res bin, with high res bin stats reported in parentheses after global stats. Then have another Table (maybe Table I* in supplementary material?) where I report stats for the whole dataset, including the weak data I used in refinement. In both tables report CC1/2 and Rmeas. This way, I don't redefine the (mostly) conventional usage of resolution, my Table I can be compared to precedent, I report stats for all the data and for the model against all data, and I take advantage of the information in the weak data during refinement. Thoughts? Douglas ^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^` Douglas L. Theobald Assistant Professor Department of Biochemistry Brandeis University Waltham, MA 02454-9110 dtheob...@brandeis.edu http://theobald.brandeis.edu/ ^\ /` /^. / /\ / / /`/ / . /` / / ' ' ' -- Alan Cheung Gene Center Ludwig-Maximilians-University Feodor-Lynen-Str. 25 81377 Munich Germany Phone: +49-89-2180-76845 Fax: +49-89-2180-76999 E-mail: che...@lmb.uni-muenchen.de
[ccp4bb] refining against weak data and Table I stats
Hello all, I've followed with interest the discussions here about how we should be refining against weak data, e.g. data with I/sigI 2 (perhaps using all bins that have a significant CC1/2 per Karplus and Diederichs 2012). This all makes statistical sense to me, but now I am wondering how I should report data and model stats in Table I. Here's what I've come up with: report two Table I's. For comparability to legacy structure stats, report a classic Table I, where I call the resolution whatever bin I/sigI=2. Use that as my high res bin, with high res bin stats reported in parentheses after global stats. Then have another Table (maybe Table I* in supplementary material?) where I report stats for the whole dataset, including the weak data I used in refinement. In both tables report CC1/2 and Rmeas. This way, I don't redefine the (mostly) conventional usage of resolution, my Table I can be compared to precedent, I report stats for all the data and for the model against all data, and I take advantage of the information in the weak data during refinement. Thoughts? Douglas ^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^` Douglas L. Theobald Assistant Professor Department of Biochemistry Brandeis University Waltham, MA 02454-9110 dtheob...@brandeis.edu http://theobald.brandeis.edu/ ^\ /` /^. / /\ / / /`/ / . /` / / ' ' '
Re: [ccp4bb] vitrification vs freezing
On Nov 16, 2012, at 10:27 AM, Enrico Stura est...@cea.fr wrote: As a referee I also dislike the word freezing but only if improperly used: The crystals were frozen in LN2 is not acceptable because it is the outside liquor that is rapidly cooled to cryogenic temperatures. right, while the crystals within the liquor remain at room temperature :) But the use of freezing used as the opposite of melting is fine and does not imply a crystalline state. Ice is not always crystalline either: http://en.wikipedia.org/wiki/Amorphous_ice -- Enrico A. Stura D.Phil. (Oxon) ,Tel: 33 (0)1 69 08 4302 Office Room 19, Bat.152, Tel: 33 (0)1 69 08 9449Lab LTMB, SIMOPRO, IBiTec-S, CE Saclay, 91191 Gif-sur-Yvette, FRANCE http://www-dsv.cea.fr/en/institutes/institute-of-biology-and-technology-saclay-ibitec-s/unites-de-recherche/department-of-molecular-engineering-of-proteins-simopro/molecular-toxinology-and-biotechnology-laboratory-ltmb/crystallogenesis-e.-stura http://www.chem.gla.ac.uk/protein/mirror/stura/index2.html e-mail: est...@cea.fr Fax: 33 (0)1 69 08 90 71
Re: [ccp4bb] vector and scalars
On Oct 16, 2010, at 3:32 PM, Ian Tickle wrote: Hi Tim As I indicated previously, the Fortran code was only meant to define my statement of the problem so that there can be absolutely no ambiguity as to the question: the answer to the problem (if it exists) has nothing whatsoever to do with the programming language used and I don't see how it can be constrained in any way by its semantics, since I also provided the questions in algebraic form. You can't get more 'natural' than that! The answer may be provided either algebraically (which would actually be preferable) or in any programming language of your choice: I am certainly not forcing anyone to code in Fortran if they don't want to. If you're saying that I'm unable to solve the problem just because I'm programming in Fortran, then you don't understand how algorithmic problem solving works: first an working solution must be obtained algebraically, then algorithmically, and only then programmatically. The first two steps are always the hardest, the last is almost always relatively trivial, and the programming language chosen is a matter of personal preference. It cannot constrain the solution, since that must already have been completely defined by the first two steps. I have not yet come across a purely algebraic problem which possesses semantics that couldn't be expressed in Fortran. That doesn't mean there aren't any, it's just that none of the problems that I've yet come across absolutely require programmng in another language: until they do I'm happy to stick with Fortran. Just to be clear again, the statement of the problem, expressed entirely algebraically, is: 1) To express F.G using vector notation only, where F and G are complex vectors of arbitrary dimension, and 2) Same with F1/G1 where F1 and G1 are complex numbers (e.g. individual elements of the above complex vectors). Ian -- Fortran itself actually treats complex numbers internally as vectors, so clearly there is a solution to your problem. In any case, you can easily program, in any language you want, F.G or F1/G1 using vector arithmetic. You cannot, however, confine yourself to the common standard dot and cross product. But, contrary to what you are apparently implying, those are not the only two possible vector multiplication operations that can be formally defined for vectors. As a simple counterexample, you can do element-wise vector multiplication and division. There is also the well-known geometric product (from Clifford algebra), the vector perp dot product, the vector direct product, and the wedge (exterior) product. The geometric product is esp. relevant here, because in 2D it is the same operation as multiplying two complex numbers (see http://en.wikipedia.org/wiki/Geometric_algebra#Complex_numbers ). These pages may be helpful for other examples: http://www.euclideanspace.com/maths/algebra/vectors/vecAlgebra/powers/index.htm http://www.euclideanspace.com/maths/algebra/vectors/vecAlgebra/exponent/index.htm As I said earlier, if an entity fulfills the axioms of a vector space, then they are vectors. http://en.wikipedia.org/wiki/Vector_space#Definition Complex numbers fulfill these axioms. On the other hand, there is no requirement for vectors to have valid dot and cross products defined. Euclidean vectors do, but that does not mean they are not vectors. Complex numbers have other operations defined for them, but again that does not mean that we cannot consider them as vectors in two dimensions. In fact, it is common in mathematics to consider complex numbers as 2x2 *matrices*, in which the matrix corresponding to i is an orthogonal 90 degree rotation matrix. Cheers, Douglas You see, absolutely no Fortran! Cheers -- Ian On Sat, Oct 16, 2010 at 8:50 AM, Tim Gruene t...@shelx.uni-ac.gwdg.de wrote: Dear Ian, maybe you should switch from Fortran to C++. Then you would not be forced to make nature follow the semantics of your programming language but can adjust your code to the problem you are tackling. The question you post would nicely fit into a first year's course on C++ (and of course can all be answered very elegantly). Cheers, Tim On Fri, Oct 15, 2010 at 11:55:54PM +0100, Ian Tickle wrote: On Fri, Oct 15, 2010 at 8:11 PM, Douglas Theobald dtheob...@brandeis.edu wrote: Vectors are not only three-dimensional, nor only Euclidean -- vectors can be defined for any number of arbitrary dimensions. Your initial comment referred to complex numbers, for instance, which are 2D vectors (not 1-D). Obviously scalars are not 3-vectors, they are 1-vectors. And contrary to your earlier assertion, you can always represent complex numbers as vectors (in fortran, C, on paper, or whatever), and it is possible to define many different valid types of multiplication, exponentiation, logarithms, powers, etc. for vectors (and matrices as well). I didn't say that vectors are only 3D or only
Re: [ccp4bb] vector and scalars
As usual, the Omniscient Wikipedia does a pretty good job of giving the standard mathematical definition of a vector: http://en.wikipedia.org/wiki/Vector_space#Definition If the thing fulfills the axioms, it's a vector. Complex numbers do, as well as scalars. On Oct 15, 2010, at 8:56 AM, David Schuller wrote: On 10/14/10 11:22, Ed Pozharski wrote: Again, definitions are a matter of choice There is no correct definition of anything. Definitions are a matter of community choice, not personal choice; i.e. a matter of convention. If you come across a short squat animal with split hooves rooting through the mud and choose to define it as a giraffe, you will find yourself ignored and cut off from the larger community which chooses to define it as a pig. -- === All Things Serve the Beam === David J. Schuller modern man in a post-modern world MacCHESS, Cornell University schul...@cornell.edu ^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^` Douglas L. Theobald Assistant Professor Department of Biochemistry Mailstop 009 415 South St Brandeis University Waltham, MA 02454-9110 dtheob...@brandeis.edu http://theobald.brandeis.edu/ Office: +1 (781) 736-2303 Fax:+1 (781) 736-2349 ^\ /` /^. / /\ / / /`/ / . /` / / ' ' ' smime.p7s Description: S/MIME cryptographic signature
Re: [ccp4bb] vector and scalars
On Oct 15, 2010, at 11:37 AM, Ganesh Natrajan wrote: Douglas, The elements of a 'vector space' are not 'vectors' in the physical sense. And there you make Ed's point -- some people are using the general vector definition, others are using the more restricted Euclidean definition. The elements of a general vector space certainly can be physical, by any normal sense of the term. And note that physical 3D space is not Euclidean, in any case. The correct Wikipedia page is this one http://en.wikipedia.org/wiki/Euclidean_vector Ganesh On Fri, 15 Oct 2010 11:20:04 -0400, Douglas Theobald dtheob...@brandeis.edu wrote: As usual, the Omniscient Wikipedia does a pretty good job of giving the standard mathematical definition of a vector: http://en.wikipedia.org/wiki/Vector_space#Definition If the thing fulfills the axioms, it's a vector. Complex numbers do, as well as scalars. On Oct 15, 2010, at 8:56 AM, David Schuller wrote: On 10/14/10 11:22, Ed Pozharski wrote: Again, definitions are a matter of choice There is no correct definition of anything. Definitions are a matter of community choice, not personal choice; i.e. a matter of convention. If you come across a short squat animal with split hooves rooting through the mud and choose to define it as a giraffe, you will find yourself ignored and cut off from the larger community which chooses to define it as a pig. -- === All Things Serve the Beam === David J. Schuller modern man in a post-modern world MacCHESS, Cornell University schul...@cornell.edu smime.p7s Description: S/MIME cryptographic signature
Re: [ccp4bb] vector and scalars
On Oct 15, 2010, at 12:14 PM, William G. Scott wrote: As usual, the Omniscient Wikipedia does a pretty good job of giving the standard mathematical definition of a vector: http://en.wikipedia.org/wiki/Vector_space#Definition If the thing fulfills the axioms, it's a vector. Complex numbers do, as well as scalars. It is a bit more complicated, unfortunately. cf: Don't you mean, it's a bit more _complex_? :) http://en.wikipedia.org/wiki/Complex_number#The_complex_plane http://en.wikipedia.org/wiki/Complex_number#Real_vector_space smime.p7s Description: S/MIME cryptographic signature
Re: [ccp4bb] off topic: multiple structural sequence alignment
Both MUSTANG and MATT are good choices: http://www.cs.mu.oz.au/~arun/mustang/ http://groups.csail.mit.edu/cb/matt/ On Jan 12, 2010, at 7:17 AM, Ronnie Berntsson wrote: Dear all, A bit off the topic question perhaps. I am trying to find a program which can do multiple structural sequence alignments. What I would like is a program which can take as input PDB codes (or files), and which will output a multiple sequence alignment in FASTA format with the full sequences of the supplied proteins intact. Preferably said server/program should be able to handle at least 20 input pdbs at once. I've been looking around, but have so far failed to find a program which does this. If anyone knows of a program or server which could handle this, I would be very grateful. Cheers, Ronnie Berntsson
Re: [ccp4bb] FW: pdb-l: Retraction of 12 Structures
On Dec 16, 2009, at 7:40 AM, Anastassis Perrakis wrote: How very correct. And if anyone is doubt, remember the fiasco of the 'memory of water', published in Nature. To borrow the title of DVD's talks, Just because its in Nature, it does not mean its true. Or, as one of my colleagues is known to say: It's in Nature, and it's even right! More specifically, we are seeing peer review at work. I that the implementation of peer review as part of the publication has perhaps lead us to forget that peer review is a longer term, ongoing process, conducted by the whole scientific community. A.
Re: [ccp4bb] units of the B factor
Argument from authority, from the omniscient Wikipedia: http://en.wikipedia.org/wiki/Radian Although the radian is a unit of measure, it is a dimensionless quantity. The radian is a unit of plane angle, equal to 180/pi (or 360/(2 pi)) degrees, or about 57.2958 degrees, It is the standard unit of angular measurement in all areas of mathematics beyond the elementary level. … the radian is now considered an SI derived unit. On Nov 23, 2009, at 1:31 PM, Ian Tickle wrote: James, I think you misunderstood, no-one is suggesting that we can do without the degree (minute, second, grad, ...), since these conversion units have considerable practical value. Only the radian (and steradian) are technically redundant, and as Marc suggested we would probably be better off without them! Cheers -- Ian -Original Message- From: owner-ccp...@jiscmail.ac.uk [mailto:owner-ccp...@jiscmail.ac.uk] On Behalf Of James Holton Sent: 23 November 2009 16:35 To: CCP4BB@jiscmail.ac.uk Subject: Re: [ccp4bb] units of the B factor Just because something is dimensionless does not mean it is unit-less. The radian and the degree are very good examples of this. Remember, the word unit means one, and it is the quantity of something that we give the value 1.0. Things can only be measured relative to something else, and so without defining for the relevant unit, be it a long-hand description or a convenient abbreviation, a number by itself is not useful. It may have meaning in the metaphysical sense, but its not going to help me solve my structure. A world without units is all well and good for theoreticians who never have to measure anything, but for those of us who do need to know if the angle is 1 degree or 1 radian, units are absolutely required. -James Holton MAD Scientist Artem Evdokimov wrote: The angle value and the associated basic trigonometric functions (sin, cos, tan) are derived from a ratio of two lengths* and therefore are dimensionless. It's trivial but important to mention that there is no absolute requirement of units of any kind whatsoever with respect to angles or to the three basic trigonometric functions. All the commonly used units come from (arbitrary) scaling constants that in turn are derived purely from convenience - specific calculations are conveniently carried out using specific units (be they radians, points, seconds, grads, brads, or papaya seeds) however the units themselves are there only for our convenience (unlike the absolutely required units of mass, length, time etc.). Artem * angle - the ratio of the arc length to radius of the arc necessary to bring the two rays forming the angle together; trig functions - the ratio of the appropriate sides of a right triangle -Original Message- From: CCP4 bulletin board [mailto:ccp...@jiscmail.ac.uk] On Behalf Of Ian Tickle Sent: Sunday, November 22, 2009 10:57 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] units of the B factor Back to the original problem: what are the units of B and u_x^2? I haven't been able to work that out. The first wack is to say the B occurs in the term Exp( -B (Sin(theta)/lambda)^2) and we've learned that the unit of Sin(theta)/lamda is 1/Angstrom and the argument of Exp, like Sin, must be radian. This means that the units of B must be A^2 radian. Since B = 8 Pi^2 u_x^2 the units of 8 Pi^2 u_x^2 must also be A^2 radian, but the units of u_x^2 are determined by the units of 8 Pi^2. I can't figure out the units of that without understanding the defining equation, which is in the OPDXr somewhere. I suspect there are additional, hidden, units in that definition. The basic definition would start with the deviation of scattering points from the Miller planes and those deviations are probably defined in cycle or radian and later converted to Angstrom so there are conversion factors present from the beginning. I'm sure that if the MS sits down with the OPDXr and follows all these units through he will uncover the units of B, 8 Pi^2, and u_x^2 and the mystery will be solved. If he doesn't do it, I'll have to sit down with the book myself, and that will make my head hurt. Hi Dale A nice entertaining read for a Sunday afternoon, but I think you can only get so far with this argument and then it breaks down, as evidenced by the fact that eventually you got stuck! I think the problem arises in your assertion that the argument of 'exp' must be in units of radians. IMO it can also be in units of radians^2 (or radians^n where n is any unitless number, integer or real, including zero for that matter!) - and this seems to be precisely what happens here. Having a function whose argument can apparently have any one of an infinite number of units is somewhat of an embarrassment! - of course that must mean that the argument actually has no
Re: [ccp4bb] units of the B factor
I agree that the official SI documentation has priority, but as I read it there is no discrepancy between it and Wikipedia. The official SI position (and that of NIST and IUPAC) is that the radian is a dimensionless unit (i.e., a unit of dimension 1). Quoting at length from the SI brochure: 2.2.3 Units for dimensionless quantities, also called quantities of dimension one Certain quantities are defined as the ratio of two quantities of the same kind, and are thus dimensionless, or have a dimension that may be expressed by the number one. The coherent SI unit of all such dimensionless quantities, or quantities of dimension one, is the number one, since the unit must be the ratio of two identical SI units. The values of all such quantities are simply expressed as numbers, and the unit one is not explicitly shown. Examples of such quantities are refractive index, relative permeability, and friction factor. There are also some quantities that are defined as a more complex product of simpler quantities in such a way that the product is dimensionless. Examples include the 'characteristic numbers' like the Reynolds number Re = ρvl/η, where ρ is mass density, η is dynamic viscosity, v is speed, and l is length. For all these cases the unit may be considered as the number one, which is a dimensionless derived unit. Another class of dimensionless quantities are numbers that represent a count, such as a number of molecules, degeneracy (number of energy levels), and partition function in statistical thermodynamics (number of thermally accessible states). All of these counting quantities are also described as being dimensionless, or of dimension one, and are taken to have the SI unit one, although the unit of counting quantities cannot be described as a derived unit expressed in terms of the base units of the SI. For such quantities, the unit one may instead be regarded as a further base unit. In a few cases, however, a special name is given to the unit one, in order to facilitate the identification of the quantity involved. This is the case for the radian and the steradian. The radian and steradian have been identified by the CGPM as special names for the coherent derived unit one, to be used to express values of plane angle and solid angle, respectively, and are therefore included in Table 3. The radian and steradian are special names for the number one that may be used to convey information about the quantity concerned. In practice the symbols rad and sr are used where appropriate, but the symbol for the derived unit one is generally omitted in specifying the values of dimensionless quantities. pp 119-120, The International System of Units (SI). International Bureau of Weights and Measures (BIPM). http://www.bipm.org/utils/common/pdf/si_brochure_8_en.pdf also see http://physics.nist.gov/cuu/Units/units.html http://www.iupac.org/publications/books/gbook/green_book_2ed.pdf On Nov 23, 2009, at 4:03 PM, marc.schi...@epfl.ch wrote: I would believe that the official SI documentation has precedence over Wikipedia. In the SI brochure it is made quite clear that Radian is just another symbol for the number one and that it may or may no be used, as is convenient. Therefore, stating alpha = 15 (without anything else) is perfectly valid for an angle. Marc Quoting Douglas Theobald dtheob...@brandeis.edu: Argument from authority, from the omniscient Wikipedia: http://en.wikipedia.org/wiki/Radian Although the radian is a unit of measure, it is a dimensionless quantity. The radian is a unit of plane angle, equal to 180/pi (or 360/(2 pi)) degrees, or about 57.2958 degrees, It is the standard unit of angular measurement in all areas of mathematics beyond the elementary level. … the radian is now considered an SI derived unit. On Nov 23, 2009, at 1:31 PM, Ian Tickle wrote: James, I think you misunderstood, no-one is suggesting that we can do without the degree (minute, second, grad, ...), since these conversion units have considerable practical value. Only the radian (and steradian) are technically redundant, and as Marc suggested we would probably be better off without them! Cheers -- Ian -Original Message- From: owner-ccp...@jiscmail.ac.uk [mailto:owner-ccp...@jiscmail.ac.uk] On Behalf Of James Holton Sent: 23 November 2009 16:35 To: CCP4BB@jiscmail.ac.uk Subject: Re: [ccp4bb] units of the B factor Just because something is dimensionless does not mean it is unit-less. The radian and the degree are very good examples of this. Remember, the word unit means one, and it is the quantity of something that we give the value 1.0. Things can only be measured relative to something else, and so without defining for the relevant unit, be it a long-hand description or a convenient abbreviation, a number by itself is not useful. It may have meaning in the metaphysical sense, but its not going to help me
Re: [ccp4bb] Rmerge - was moelcular replacement with large cell
James, Graeme is right. While I does indeed (approximately) follow a Gaussian, |I-I| cannot. The absolute value operator keeps it positive (reflects the negative across the origin), and hence it is a half Gaussian. Its mean cannot be zero unless the variance is zero. For standard normals (variance = 1), the mean of |I-I| is 0.798, just as Graeme said. You can do the integration. So, the fact that | I-I|/I is unstable at low I/sigma is *not* a consequence of the peculiar divergent properties of a Cauchy (Lorentzian). Rather, it's a consequence of E(I) being zero. And, like your calculator knows, division by zero is undefined (or infinite, depending on your proclivities). Cheers, Douglas On Jul 15, 2009, at 5:03 PM, James Holton wrote: I tried plugging I/sigma = 0 into your formula below, but my calculator returned -James Holton MAD Scientist Graeme Winter wrote: James, I'm not sure you're completely right here - it's reasonably straightforward to show that Rmerge ~ 0.7979 / (I/sigma) (Weiss Hilgenfeld, J. Appl. Cryst 1997) which can be verified from e.g. the Scala log file, provided that the *unmerged* I/sigma is considered: http://www.ccp4.ac.uk/xia/rmerge.jpg This example did not exhibit much radiation damage so it does represent a best case. For (unmerged) I/sigma 1 the statistics do tend to become unreliable, which I found was best demonstrated by inspection of the E^4 plot - up to I/sigma ~ 1 it was ~ 2, but increased substantially thereafter. This I had assumed represented the fact that the intensities were drawn from a gaussian distribution with low I/ sigma rather than the exponential (WIlson) distribution which would be expected for intensities. By repeatedly selecting small random subsets* of unique reflections in the example data set and merging them separately, I found that the error on the Rmerge above for the weakest reflections was about 0.05. Since this retains the same multiplicity and the mean value converges on the complete data set statistics, I believe that the comparisons are valid. I guess I don't believe you :o) Best, Graeme * CCTBX is awesome for this kind of thing! 2009/7/15 James Holton jmhol...@lbl.gov: Actually, if I/sd 3, Rmerge, Rpim, Rrim, etc. are all infinity. Doesn't matter what your redundancy is. Don't believe me? Try it. The extreme case is I/sd = 0, and as long as there is some background (and, let's face it, there always is), the observed spot intensity will be equally likely to be positive or negative, with a (basically) Gaussian distribution. So, if you generate say, ten Gaussian-random numbers (centered on zero), take their average value I, compute the average deviation from that average |I-I|, and then divide |I-I|/I, you will get the Rmerge expected for I/sd = 0 at a redundancy of 10. Problem is, if you do this again with a different random number seed, you will get a very different Rmerge. Even if you do it with a million different random number seeds and compute the average Rmerge, you will always get wildly different values. Some positive, some negative. And it doesn't matter how many data points you use to compute the Rmerge: averaging a million Rmerge values will give a different answer than averaging a million and one. The reason for this numerical instability is because both I and |I-I| follow a Gaussian distribution that is centered at zero, and the ratio of two numbers like this has a Lorentzian distribution. The Lorentzian looks a lot like a Gaussian, but has much fatter tails. Fat enough so that the Lorentzian distribution has NO MEAN VALUE. Seriously. It is hard to believe that the average value of something that is equally likely to be positive or negative could be anything but zero, but for all practical purposes you can never arrive at the average value of something with a Lorentzian distribution. At least not by taking finite samples. So, no matter what the redundancy, you will always get a different Rmerge. However, if I is not centered on zero (I/sd 0), then the ratio of the two Gaussian-random numbers starts to look like a Gaussian itself, and this distribution does have a mean value (Rmerge will be reproducible). However, this does not happen all at once. The tails start to shrink as I/sd = 1, they are even smaller at I/sd = 2, and the distribution finally looses all Lorentzian character when I/sd = 3. Only then is Rmerge a meaningful quantity. So, perhaps our forefathers who first instituted the practice of a 3-sigma cutoff for all intensities actually DID know what they were doing! All R- statistics (including Rcryst and Rfree) are unstable in this way for weak data, but sometime in the early 1990s the practice of computing R- factors on all data crept into the field. I'm not saying we should not use all data, maximum likelihood refinement uses sigmas properly and weak data are
Re: [ccp4bb] 3D modeling program
- Dima Klenchin [EMAIL PROTECTED] wrote: But how do we establish phylogeny? - Based on simple similarity! (Structural/morphological in early days and largely on sequence identity today). It's clearly a circular logic: Hardly. Two sequences can be similar and non-homologous at all levels. Also, two similar proteins can be homologous at one level but not at another. It's also possible for two proteins that have no detectable similarity above random sequences to be homologous. Hence there is no circularity. Of course there is. Just how do you establish that the two are not homologous? - By finding that they don't belong to the same branch. And how do you decide what constitutes the same branch? - By looking at how similar things are! But you have not established that there is circularity. Logical circularity means that you assume (as an essential premise) one of your conclusions. What exactly is the argument you are criticizing, and what is the conclusion that is assumed? When we conclude that two proteins are homologous at some level, we have not assumed that they are homologous at that level. Rather, the conclusion of homology is an inference that uses similarity as relevant evidence. Plus, presumably all living things trace their ancestry to the primordial soup - so the presence or a lack of ancestry is just a matter of how deeply one is willing to look. This is also wrong. Even if all organisms trace back to one common ancestor, that does not mean all proteins are homologous. New protein coding genes can and do arise independently, and hence they are not homologous to any other existing proteins. Just how do they arise independently? Would that be independent of DNA sequence? And if not, then why can't shared ancestry of the DNA sequence fully qualify for homology? Perhaps it could (although in some cases no), but still the new protein would not be homologous to any other protein *at the protein level*. You also ignore the levels of homology concept -- just because two proteins are homologous at one level does not mean they are homologous at others. For example, consider these three TIM barrel proteins: human IMPDH, hamster IMPDH, and chicken triose phosphate isomerase. They are all three homologous as TIM barrels. However, they are not all homologous as dehydrogenases -- only the human and hamster proteins are homologous as dehydrogenases. ... And all that is concluded based on sequence similarities [of other proteins/DNAs] to construct phylogenetic tree. So, ultimately, homology ~ similarity. This is a non sequitur. Yes, homology inference uses similarity as evidence, but that does not mean homology is equivalent to similarity. Two facile counterexamples to your claim: two proteins can be very similar yet non-homologous, and two very dissimilar proteins can be homologous. Homology is thus not equivalent to similarity. QED. The generic concept of homology used to be used as a proof of evolution. Today, things seem to be reversed and evolution is being used to infer homology. A useful concept turned into a statement with little or no utility. In fact quite the opposite is true. Before evolutionary theory, homology was a vacuous, mysterious concept with no utility. It was simply the descriptive observation that similar structures could have different functions. Now we know why that is the case. You have already pointed out that we have redefined homology (evolutionary homology is not the same as generic, pre-evolutionary homology), and this fact proves that the logic is non-circular: we assume generic homology and conclude evolutionary homology. This could only be circular if the two concepts were identical, which you admit they are not. Your argument founders on an equivocation. Cheers, Douglas
Re: [ccp4bb] 3D modeling program
- Dima Klenchin [EMAIL PROTECTED] wrote: But how do we establish phylogeny? - Based on simple similarity! This is a common, but erroneous, misconception. Modern phylogenetic methods (Bayesian, maximum likelihood, and some distance-based) rely on explicit models of molecular evolution, and the *patterns* of similarity they create. Even maximum parsimony, which is not model-based, does not reconstruct phylogenies based on simple similarity. ah! the old rhetorical trick of changing the problem or question a posteriori! all i pointed out was that things can't be 25% homologous Well, you were right that in today's definition things can't be. But you seem to be missing my point that today's definition is essentially meaningless (relies on circular logic and has no epistemologic value) and that nothing would be lost if the term reverted to its generic usage, similar. There would still be a question to be asked similar for what reason? - same question that is presumed to be answered whenever one invokes phylogeny-based homology. How does this make any sense? Two proteins can have certain similarities in sequence (or structure) due to either convergence or homology. That is the answer to your question of similar for what reason, and hence you have just shown that similarity is not the same as homology, and that homology is not meaningless. i'm glad your opinion is humble here, because it has much to be humble about :-) do you really think that property (e.g., structure and function) prediction is not useful? and i can't even begin to understand how you can think that 'homology' in its present-day meaning is a pre-darwinian concept. Homology is a pre-Darwinian concept that was *redefined* post-Darwin. That's what I wrote. okay, so can we all agree now that we won't be saying and writing things like the two proteins are X% homologous anymore from now on? IMHO, it truly does not matter if we do or do not as long as we understand each other. You are hard to understand if you say that two proteins are 25% homologous. Do you mean that one domain, out of four, is homologous between the proteins? That is the only sense in which that could be construed as correct. Like I wrote in the original reply, paying too much attention to definitions of fuzzy abstract concepts is not worth it. The homology concept is often misunderstood, that is true. But there are still blatantly incorrect uses, and substituting 25% homologous for 25% similar is unequivocaly wrong. An important point to note is that homology must be qualified. There are levels of homology, and a structure can be homologous at one level but not at another. The classic example is bird and bat wings. They are homologous as vertebrate forelimbs, but not as wings.
Re: [ccp4bb] 3D modeling program
- Anastassis Perrakis [EMAIL PROTECTED] wrote: I think we are getting a bit too philosophical on a matter which is mainly terminology . 1. To quantify how similar two proteins are, one should best refer to 'percent identity'. Thats clear, correct and unambiguous. 2. One can also refer to similarity. In that case it should be clarified what is considered to be similar, mainly which comparison matrix was used to quantify the similarity. 3. Homology means common evolutionary origin. One understanding is that homology refers to the genome of 'LUCA', the hypothetical last universal common ancestor. I am not an evolutionary biologist, but I would clearly disagree that homology is a leftover pre-Darwinian term. The very notion of homology is only meaningful in the context of evolution. Thus, to me: 1. These proteins are 56% identical is clear. Even this is unclear without qualification. Identity is always determined by alignment, and you can get different %ID by using different matrices. 2. These proteins are 62% similar is unclear. 3. These proteins are 62% similar using the Dayhoff-50 matrix is Ok. 4. These proteins are homologous is clear, but can be subjective as to what homology is. 5. These proteins are 32% homologous is simply wrong. Sorry for the non-crystallographic late evening blabber. A. On 6 Dec 2008, at 21:09, Dima Klenchin wrote: Having a generic dictionary definition is nice and dandy. However, in the present context, the term 'homology' has a much more specific meaning: it pertains to the having (or not) of a common ancestor. Thus, it is a binary concept. (*) But how do we establish phylogeny? - Based on simple similarity! (Structural/morphological in early days and largely on sequence identity today). It's clearly a circular logic: Lets not use generic definition; instead, lets use a specialized definition; and lets not notice that the specialized definition wholly depends on a system that is built using the generic definition to begin with. Plus, presumably all living things trace their ancestry to the primordial soup - so the presence or a lack of ancestry is just a matter of how deeply one is willing to look. In other words, it's nice and dandy to have theoretical binary concept but in practice it is just as fuzzy as anything else. IMHO, the phylogenetic concept of homology in biology does not buy you much of anything useful. It seems to be just a leftover from pre- Darwinian days - redefined since but still lacking solid foundation. Dima
Re: [ccp4bb] 3D modeling program
- Dima Klenchin [EMAIL PROTECTED] wrote: Having a generic dictionary definition is nice and dandy. However, in the present context, the term 'homology' has a much more specific meaning: it pertains to the having (or not) of a common ancestor. Thus, it is a binary concept. (*) But how do we establish phylogeny? - Based on simple similarity! (Structural/morphological in early days and largely on sequence identity today). It's clearly a circular logic: Hardly. Two sequences can be similar and non-homologous at all levels. Also, two similar proteins can be homologous at one level but not at another. It's also possible for two proteins that have no detectable similarity above random sequences to be homologous. Hence there is no circularity. Lets not use generic definition; instead, lets use a specialized definition; and lets not notice that the specialized definition wholly depends on a system that is built using the generic definition to begin with. Plus, presumably all living things trace their ancestry to the primordial soup - so the presence or a lack of ancestry is just a matter of how deeply one is willing to look. This is also wrong. Even if all organisms trace back to one common ancestor, that does not mean all proteins are homologous. New protein coding genes can and do arise independently, and hence they are not homologous to any other existing proteins. You also ignore the levels of homology concept -- just because two proteins are homologous at one level does not mean they are homologous at others. For example, consider these three TIM barrel proteins: human IMPDH, hamster IMPDH, and chicken triose phosphate isomerase. They are all three homologous as TIM barrels. However, they are not all homologous as dehydrogenases -- only the human and hamster proteins are homologous as dehydrogenases. In other words, it's nice and dandy to have theoretical binary concept but in practice it is just as fuzzy as anything else. IMHO, the phylogenetic concept of homology in biology does not buy you much of anything useful. It seems to be just a leftover from pre-Darwinian days - redefined since but still lacking solid foundation. Dima