On general scientific principles the reasons for archiving "raw data"
all boil down to one thing: there was a systematic error, and you hope
to one day account for it. After all, a "systematic error" is just
something you haven't modeled yet. Is it worth modelling? That depends...
There are two main kinds of systematic error in MX:
1) Fobs vs Fcalc
Given that the reproducibility of Fobs is typically < 3%, but
typical R/Rfree values are in the 20%s, it is safe to say that this is a
rather whopping systematic error. What causes it? Dunno. Would
structural biologists benefit from being able to model it? Oh yes!
Imagine being able to reliably see a ligand that has an occupancy of
only 0.05, or to be able to unambiguously distinguish between two
proposed reaction mechanisms and back up your claims with hard-core
statistics (derived from SIGF). Perhaps even teasing apart all the
different minor conformers occupied by the molecule in its functional
cycle? I think this is the main reason why we all decided to archive
Fobs: 20% error is a lot.
2) scale factors
We throw a lot of things into "scale factors", including sample
absorption, shutter timing errors, radiation damage, flicker in the
incident beam, vibrating crystals, phosphor thickness, point-spread
vaiations, and many other phenomena. Do we understand the physics
behind them? Yes (mostly). Is there "new biology" to be had by
modelling them more accurately? No. Unless, of course, you count all
the structures we have not solved yet.
Wouldn't it be nice if phasing from sulfur, phosphorous, chloride and
other "native" elements actually worked? You wouldn't have to grow
SeMet protein anymore, and you could go after systems that don't express
well in E. coli. Perhaps even going to the native source! I think
there is plenty of "new biology" to be had there. Wouldn't it be nice
if you could do S-SAD even though your spots were all smeary and
overlapped and mosaic and radiation damaged?
Why don't we do this now? Simple!: it doesn't work. Why doesn't it
work? Because we don't know all the "scale factors" accurately enough.
In most cases, the "% error" from all the scale factors usually adds up
to ~3% (aka Rmerge, Rpim etc.), but the change in spot intensities due
to native element anomalous scattering is usually less than 1%.
Currently, the world record for smallest Bijvoet ratio is ~0.5% (Wang et
al. 2006), but if photon-counting were the only source of error, we
should be able to get Rmerge of ~0.1% or less, particularly in the
low-angle resolution bins. If we can do that, then there will be little
need for SeMet anymore.
But, we need the "raw" images if we are to have any hope of figuring out
how to get the errors down to the 0.1% level. There is no one magic
dataset that will tell us how to do this, we need to "average over" lots
of them. Yes, this is further "upstream" of the "new biology" than
deposited Fs, and yes the cost of archiving images is higher, but I
think the potential benefits to the structural biology community if we
can crack the 0.1% S-SAD barrier is nothing short of revolutionary.
-James Holton
MAD Scientist
On 11/1/2011 8:32 AM, Anastassis Perrakis wrote:
Dear Gerard
Isolating your main points:
but there would have been no PDB-REDO because the
data for running it would simply not have been available! ;-) . Or do
you
think the parallel does not apply?
...
have thought, some value. From the perspective of your message, then,
why
are the benefits of PDB-REDO so unique that PDB-REPROCESS would have no
chance of measuring up to them?
I was thinking of the inconsistency while sending my previous email
... ;-)
Basically, the parallel does apply. PDB-REPROCESS in a few years would
be really fantastic - speaking as a crystallographer and methods
developer.
Speaking as a structural biologist though, I did think long and hard
about
the usefulness of PDB_REDO. I obviously decided its useful since I am now
heavily involved in it for a few reasons, like uniformity of final
model treatment,
improving refinement software, better statistics on structure quality
metrics,
and of course seeing if the new models will change our understanding of
the biology of the system.
An experiment that I would like to do as a structural biologist - is
the following:
What about adding an "increasing noise" model to the Fobs's of a few
datasets and re-refining?
How much would that noise change the final model quality metrics and
in absolute terms?
(for the changes that PDB_RE(BUILD) does have a preview at
http://www.ncbi.nlm.nih.gov/pubmed/22034521
....I tried to avoid the shamelessly self-promoting plug-in, but could
resists at the end!)
That experiment - or a better designed variant for it ? - would maybe
tell us if we should be advocating the archive of all images,
and being scientifically convinced of the importance of that beyond
methods development, we would all argue a strong case
to the funding and hosting agencies.
Tassos
PS Of course, that does not negate the all-important argument, that
when struggling with marginal
data better processing software is essential. There is a clear need
for better software
to process images, especially for low resolution and low signal/noise
cases.
Since that is dependent on having test data I am all for supporting an
initiative to collect such data,
and I would gladly spend a day digging our archives to contribute.