On general scientific principles the reasons for archiving "raw data" all boil down to one thing: there was a systematic error, and you hope to one day account for it. After all, a "systematic error" is just something you haven't modeled yet. Is it worth modelling? That depends...

There are two main kinds of systematic error in MX:
1) Fobs vs Fcalc
Given that the reproducibility of Fobs is typically < 3%, but typical R/Rfree values are in the 20%s, it is safe to say that this is a rather whopping systematic error. What causes it? Dunno. Would structural biologists benefit from being able to model it? Oh yes! Imagine being able to reliably see a ligand that has an occupancy of only 0.05, or to be able to unambiguously distinguish between two proposed reaction mechanisms and back up your claims with hard-core statistics (derived from SIGF). Perhaps even teasing apart all the different minor conformers occupied by the molecule in its functional cycle? I think this is the main reason why we all decided to archive Fobs: 20% error is a lot.

2) scale factors
We throw a lot of things into "scale factors", including sample absorption, shutter timing errors, radiation damage, flicker in the incident beam, vibrating crystals, phosphor thickness, point-spread vaiations, and many other phenomena. Do we understand the physics behind them? Yes (mostly). Is there "new biology" to be had by modelling them more accurately? No. Unless, of course, you count all the structures we have not solved yet.

Wouldn't it be nice if phasing from sulfur, phosphorous, chloride and other "native" elements actually worked? You wouldn't have to grow SeMet protein anymore, and you could go after systems that don't express well in E. coli. Perhaps even going to the native source! I think there is plenty of "new biology" to be had there. Wouldn't it be nice if you could do S-SAD even though your spots were all smeary and overlapped and mosaic and radiation damaged?

Why don't we do this now? Simple!: it doesn't work. Why doesn't it work? Because we don't know all the "scale factors" accurately enough. In most cases, the "% error" from all the scale factors usually adds up to ~3% (aka Rmerge, Rpim etc.), but the change in spot intensities due to native element anomalous scattering is usually less than 1%. Currently, the world record for smallest Bijvoet ratio is ~0.5% (Wang et al. 2006), but if photon-counting were the only source of error, we should be able to get Rmerge of ~0.1% or less, particularly in the low-angle resolution bins. If we can do that, then there will be little need for SeMet anymore.

But, we need the "raw" images if we are to have any hope of figuring out how to get the errors down to the 0.1% level. There is no one magic dataset that will tell us how to do this, we need to "average over" lots of them. Yes, this is further "upstream" of the "new biology" than deposited Fs, and yes the cost of archiving images is higher, but I think the potential benefits to the structural biology community if we can crack the 0.1% S-SAD barrier is nothing short of revolutionary.

-James Holton
MAD Scientist

On 11/1/2011 8:32 AM, Anastassis Perrakis wrote:
Dear Gerard

Isolating your main points:

but there would have been no PDB-REDO because the
data for running it would simply not have been available! ;-) . Or do you
think the parallel does not apply?
...
have thought, some value. From the perspective of your message, then, why
are the benefits of PDB-REDO so unique that PDB-REPROCESS would have no
chance of measuring up to them?

I was thinking of the inconsistency while sending my previous email ... ;-)

Basically, the parallel does apply. PDB-REPROCESS in a few years would
be really fantastic - speaking as a crystallographer and methods developer.

Speaking as a structural biologist though, I did think long and hard about
the usefulness of PDB_REDO. I obviously decided its useful since I am now
heavily involved in it for a few reasons, like uniformity of final model treatment, improving refinement software, better statistics on structure quality metrics,
and of course seeing if the new models will change our understanding of
the biology of the system.

An experiment that I would like to do as a structural biologist - is the following: What about adding an "increasing noise" model to the Fobs's of a few datasets and re-refining? How much would that noise change the final model quality metrics and in absolute terms?

(for the changes that PDB_RE(BUILD) does have a preview at http://www.ncbi.nlm.nih.gov/pubmed/22034521 ....I tried to avoid the shamelessly self-promoting plug-in, but could resists at the end!)

That experiment - or a better designed variant for it ? - would maybe tell us if we should be advocating the archive of all images, and being scientifically convinced of the importance of that beyond methods development, we would all argue a strong case
to the funding and hosting agencies.

Tassos

PS Of course, that does not negate the all-important argument, that when struggling with marginal data better processing software is essential. There is a clear need for better software to process images, especially for low resolution and low signal/noise cases. Since that is dependent on having test data I am all for supporting an initiative to collect such data,
and I would gladly spend a day digging our archives to contribute.

Reply via email to