Re: [ccp4bb] should the final model be refined against full datset

2011-10-18 Thread Ed Pozharski
 Selecting a test set that minimizes Rfree is so wrong on so many levels.
 Unless, of course, the only thing I know about Rfree is that it is the
 magic number that I need to make small by all means necessary.

By using a simple genetic algorithm, I managed to get Rfree for a
well-refined model as low as 14.6% and as high as 19.1%.  The dataset is
not too small (~40,000 reflection in all with the standard sized 5% test
set).  So you can get spread as wide as 4.5% even with not-so-small
dataset.  Only ~1/3 of test reflections are exchanged to achieve this.

What's curious is that, contrary to my expectations, the test set
remains well distributed throughout resolution shells upon this awful
optimization and the F/sigF for the working set and test set remain
close.  Not sure how to judge which model is actually better, but it's
noteworthy that the FOM gets worse for *both* upward and downward
optimization of the test set.


-- 
After much deep and profound brain things inside my head, 
I have decided to thank you for bringing peace to our home.
Julian, King of Lemurs


Re: [ccp4bb] should the final model be refined against full datset

2011-10-17 Thread Tim Gruene
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Dear Nicholas,

for a data set with 5132 unique reflections you should flag 10.5% for
Rfree, otherwise you could as well drop Rfree completely and use the
whole data set for refinement. At least this is how I understand Axel
Brunger's article about Rfree where he states that one needs 500-1000
reflections for a significant meaning of Rfree.

I have wondered where the '5%-rule' came in which compromises the Rfree
for low resolution data sets (especially with high symmetry).

If Axel Brunger's initial statement has become obsolete I would
appreciate some clarification on the required number of flagged
reflection, but until then I will keep on flagging 500-1000 reflections,
rather than 5%.

Tim

On 10/15/2011 10:48 AM, Nicholas M Glykos wrote:
 For structures with a small number of reflections, the statistical 
 noise in the 5% sets can be very significant indeed. We have seen 
 differences between Rfree values obtained from different sets reaching 
 up to 4%.

 This is very intriguing indeed! Is there something specific in these 
 structures that Rfree differences depending on the set used reach 4%? 
 NCS? Or the 5% set having less than ~1000-1500 reflections?
 
 Tassos, by your standards, these structures should have been described as 
 'tiny' and not small ... ;-)   [Yes, significantly less than 1000. In one 
 case the _total_ number of reflections was 5132 reflections (which were, 
 nevertheless, slowly and meticulously measured by a CAD4 one-by-one. These 
 were the days ... :-)) ].
 
 
 
 

- -- 
- --
Dr Tim Gruene
Institut fuer anorganische Chemie
Tammannstr. 4
D-37077 Goettingen

GPG Key ID = A46BEE1A

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iD8DBQFOm9p6UxlJ7aRr7hoRAkumAKD5beU+JnpRuO7TJF1232a1axMtAACdHCI5
nf8+rtr5Are0kBgmk9w0rg4=
=Agc9
-END PGP SIGNATURE-


Re: [ccp4bb] should the final model be refined against full datset

2011-10-17 Thread John R Helliwell
Dear Gerard,Tom and Bernhard,
Thankyou for highlighting the IUCr Diffraction Data Deposition Working
Group and Forum.


Dear Colleagues,
I am travelling at present and apologise for not replying sooner to
the CCP4bb, and also am with intermittent email access until later
this week when I 'return to office'.

The points being raised in this CCP4bb thread are very important and
the IUCr also recognises this.

The role of the IUCr Working Group that has been set up is to bring to
a focus information and to identify steps forward. We seek  to make
progress towards archiving and making available all relevant
scientific data associated with a publication (or a completed
structure deposition in a validated database such as the PDB). The
consultation process is being formalised via the IUCr Forum pages. The
Working Group and a wider Group consisting of IUCr Commissions and
consultants has been established for discussion and planning. We are
also aiming at a community consultation via the Forum approach and we
will launch the Forum for this asap.

The IUCr invites as wide as possible inputs, from the various
communities that the IUCr Commissions serve, on the diffraction data
deposition future, which can surely be improved. Thus this Forum will
help to record an organised set of inputs for future reference.

The Forum is being set up and will require registration, which is a
straightforward process. Details will follow shortly.

Members of the Working Group and its consulted representatives are listed below.

Best wishes and regards,
Yours sincerely,
John
Prof John R Helliwell DSc
Chairman of the IUCr Diffraction Data Deposition Working Group (IUCr DDD WG).



IUCr DDD WG Members
Steve Androulakis (TARDIS representative)
John R. Helliwell (Chair) (IUCr ICSTI Representative; Chairman of the
IUCr Journals Commission 1996-2005)
Loes Kroon-Batenburg (Data processing software)
Brian McMahon (IUCr CODATA Representative)
John Westbrook (wwPDB representative and COMCIFS)
Sol Gruner (Diffuse scattering specialist and SR Facility Director)
Heinz-Josef Weyer (SR and Neutron Facility user)
Tom Terwilliger (Macromolecular Crystallography)


Consultants:
Alun Ashton (Diamond Light Source (DLS); Data Archive leader there)
Herbert Bernstein (Head of the imgCIF Dictionary Maintenance Group and
member of COMCIFS)
Frances Bernstein (Observer on data deposition policies)
Gerard Bricogne (Active software and methods developer)
Bernhard Rupp ( Macromolecular crystallographer)

IUCr Commissions (Chairs and/or alternates).



On Sat, Oct 15, 2011 at 1:32 AM, Gerard Bricogne g...@globalphasing.com wrote:
 Dear Tom,

     I am not sure that I feel happy with your invitation that views on such
 crucial matters as these deposition issues be communicated to you off-list.
 It would seem much healthier if these views were aired out within the BB.
 Again!, some will say ... but the difference is that there is now a forum
 for them, set up by the IUCr, that may eventually turn opinions into some
 form of action.

     I am sure that many subscribers to this BB, and not just you as a
 member of some committees, would be interested to hear the full variety of
 views on the desirable and the feasible in these areas, and to express their
 own for everyone to read and discuss.

     Perhaps John Helliwell can elaborate on this and on the newly created
 forum.


     With best wishes,

          Gerard.

 --
 On Fri, Oct 14, 2011 at 04:56:20PM -0600, Thomas C. Terwilliger wrote:
 For those who have strong opinions on what data should be deposited...

 The IUCR is just starting a serious discussion of this subject. Two
 committees, the Data Deposition Working Group, led by John Helliwell,
 and the Commission on Biological Macromolecules (chaired by Xiao-Dong Su)
 are working on this.

 Two key issues are (1) feasibility and importance of deposition of raw
 images and (2) deposition of sufficient information to fully reproduce the
 crystallographic analysis.

 I am on both committees and would be happy to hear your ideas (off-list).
 I am sure the other members of the committees would welcome your thoughts
 as well.

 -Tom T

 Tom Terwilliger
 terwilli...@lanl.gov


  This is a follow up (or a digression) to James comparing test set to
  missing reflections.  I also heard this issue mentioned before but was
  always too lazy to actually pursue it.
 
  So.
 
  The role of the test set is to prevent overfitting.  Let's say I have
  the final model and I monitored the Rfree every step of the way and can
  conclude that there is no overfitting.  Should I do the final refinement
  against complete dataset?
 
  IMCO, I absolutely should.  The test set reflections contain
  information, and the final model is actually biased towards the
  working set.  Refining using all the data can only improve the accuracy
  of the model, if only slightly.
 
  The second question is practical.  Let's say I want to deposit the
  results of the refinement against the full 

Re: [ccp4bb] should the final model be refined against full datset

2011-10-17 Thread Thomas C. Terwilliger
I think that we are using the test set for many things:

1. Determining and communicating to others whether our overall procedure
is overfitting the data.

2. Identifying the optimal overall procedure in cases where very different
options are being considered (e.g., should I use TLS).

3. Calculating specific parameters (eg sigmaA).

4. Identifying the best set of overall parameters.

I would suggest that we should generally restrict our usage of the test
set to purposes #1-3.  Given a particular overall procedure for
refinement, a very good set of parameters should be obtainable from the
working set of data.

In particular, approaches in which many parameters (in the limit... all
parameters) are fit to minimize Rfree do not seem likely to produce the
best model overall.  It might be worth doing some experiments with the
super-free set approach to determine whether this is true.


 Hi,

 On Sun, Oct 16, 2011 at 7:48 PM, Ed Pozharski
 epozh...@umaryland.eduwrote:

 On Sat, 2011-10-15 at 11:48 +0300, Nicholas M Glykos wrote:
For structures with a small number of reflections, the
  statistical
noise in the 5% sets can be very significant indeed. We have seen
differences between Rfree values obtained from different sets
  reaching
up to 4%.


 this is in line with my observations too.
 Not surprising at all, though (see my previous post on this subject): a
 small seemingly insignificant change somewhere may result in refinement
 taking a different pathway leading to a different local minimum. There is
 even way of making practical use of this (Rice, Shamoo  Brunger, 1998;
 Korostelev, Laurberg  Noller, 2009; ...).

 This seemingly insignificant change somewhere may be:
 - what Ed mentioned (different noise level in free reflections or simply
 different strength of reflections in free set between sets);
 - slightly different staring conditions (starting parameter value);
 - random seed used in Xray/restraints target weight calculation (applies
 to
 phenix.refine),
 - I can go on for 10+ possibilities.

 I do not know whether choosing the result with the lowest Rfree is a good
 idea or not (after reading Ed's post I am slightly puzzled now), but
 what's
 definitely a good idea in my opinion is to know the range of possible
 R-factor values in your specific case, so you know which difference
 between
 two R-factors obtained in two refinement runs is significant and which one
 is not.

 Pavel



Re: [ccp4bb] should the final model be refined against full datset

2011-10-17 Thread Pavel Afonine
Yes, Rsleep seems to be just the right thing to use for this:

Separating model optimization and model validation in statistical
cross-validation as applied to crystallography
G. J. Kleywegt
Acta Cryst. (2007). D63, 939-940

Practically, it would mean that we split 10% of test reflections into 5%
used for optimizations like #1-4, and the other 5% (sleep set) is never ever
used for anything. The big question here is: whether this will make any
important difference? I suspect, as with many similar things, there will be
no clear-cut answer (that is it may or may not make difference, case
dependent).

Pavel

On Mon, Oct 17, 2011 at 8:57 AM, Thomas C. Terwilliger terwilli...@lanl.gov
 wrote:

 I think that we are using the test set for many things:

 1. Determining and communicating to others whether our overall procedure
 is overfitting the data.

 2. Identifying the optimal overall procedure in cases where very different
 options are being considered (e.g., should I use TLS).

 3. Calculating specific parameters (eg sigmaA).

 4. Identifying the best set of overall parameters.

 I would suggest that we should generally restrict our usage of the test
 set to purposes #1-3.  Given a particular overall procedure for
 refinement, a very good set of parameters should be obtainable from the
 working set of data.

 In particular, approaches in which many parameters (in the limit... all
 parameters) are fit to minimize Rfree do not seem likely to produce the
 best model overall.  It might be worth doing some experiments with the
 super-free set approach to determine whether this is true.


  Hi,
 
  On Sun, Oct 16, 2011 at 7:48 PM, Ed Pozharski
  epozh...@umaryland.eduwrote:
 
  On Sat, 2011-10-15 at 11:48 +0300, Nicholas M Glykos wrote:
 For structures with a small number of reflections, the
   statistical
 noise in the 5% sets can be very significant indeed. We have seen
 differences between Rfree values obtained from different sets
   reaching
 up to 4%.
 
 
  this is in line with my observations too.
  Not surprising at all, though (see my previous post on this subject): a
  small seemingly insignificant change somewhere may result in refinement
  taking a different pathway leading to a different local minimum. There
 is
  even way of making practical use of this (Rice, Shamoo  Brunger, 1998;
  Korostelev, Laurberg  Noller, 2009; ...).
 
  This seemingly insignificant change somewhere may be:
  - what Ed mentioned (different noise level in free reflections or simply
  different strength of reflections in free set between sets);
  - slightly different staring conditions (starting parameter value);
  - random seed used in Xray/restraints target weight calculation (applies
  to
  phenix.refine),
  - I can go on for 10+ possibilities.
 
  I do not know whether choosing the result with the lowest Rfree is a
 good
  idea or not (after reading Ed's post I am slightly puzzled now), but
  what's
  definitely a good idea in my opinion is to know the range of possible
  R-factor values in your specific case, so you know which difference
  between
  two R-factors obtained in two refinement runs is significant and which
 one
  is not.
 
  Pavel
 



Re: [ccp4bb] should the final model be refined against full datset

2011-10-16 Thread Ed Pozharski
On Sat, 2011-10-15 at 11:48 +0300, Nicholas M Glykos wrote:
   For structures with a small number of reflections, the
 statistical 
   noise in the 5% sets can be very significant indeed. We have seen 
   differences between Rfree values obtained from different sets
 reaching 
   up to 4%.

This produces a curious paradox.

One possible reason for the variation in Rfree when choosing a different
test sets is that by pure chance reflections with more/less noise can be
selected.  Which automatically means that the working set contains
reflections with less/more noise and therefore the model (presumably)
gets better/worse.  So, selecting a test set that results in lower Rfree
leads to the model which is likely worse?

In fact, an obvious way to improve the Rfree through choice of a better
test set is by biasing it towards stronger reflections in each
resolution shell.

Selecting a test set that minimizes Rfree is so wrong on so many levels.
Unless, of course, the only thing I know about Rfree is that it is the
magic number that I need to make small by all means necessary.

Cheers,

Ed.


-- 
Oh, suddenly throwing a giraffe into a volcano to make water is crazy?
Julian, King of Lemurs


Re: [ccp4bb] should the final model be refined against full datset

2011-10-16 Thread Pavel Afonine
Hi,

On Sun, Oct 16, 2011 at 7:48 PM, Ed Pozharski epozh...@umaryland.eduwrote:

 On Sat, 2011-10-15 at 11:48 +0300, Nicholas M Glykos wrote:
For structures with a small number of reflections, the
  statistical
noise in the 5% sets can be very significant indeed. We have seen
differences between Rfree values obtained from different sets
  reaching
up to 4%.


this is in line with my observations too.
Not surprising at all, though (see my previous post on this subject): a
small seemingly insignificant change somewhere may result in refinement
taking a different pathway leading to a different local minimum. There is
even way of making practical use of this (Rice, Shamoo  Brunger, 1998;
Korostelev, Laurberg  Noller, 2009; ...).

This seemingly insignificant change somewhere may be:
- what Ed mentioned (different noise level in free reflections or simply
different strength of reflections in free set between sets);
- slightly different staring conditions (starting parameter value);
- random seed used in Xray/restraints target weight calculation (applies to
phenix.refine),
- I can go on for 10+ possibilities.

I do not know whether choosing the result with the lowest Rfree is a good
idea or not (after reading Ed's post I am slightly puzzled now), but what's
definitely a good idea in my opinion is to know the range of possible
R-factor values in your specific case, so you know which difference between
two R-factors obtained in two refinement runs is significant and which one
is not.

Pavel


Re: [ccp4bb] should the final model be refined against full datset

2011-10-15 Thread Nicholas M Glykos
Dear Ethan, List,

 Surely someone must have done this!  But I can't recall ever reading
 an analysis of such a refinement protocol.  
 Does anyone know of relevant reports in the literature?

Total statistical cross validation is indeed what we should be doing, but 
for large structures the computational cost may be significant. In the 
absence of total statistical cross validation the reported Rfree may be an 
'outlier' (with respect to the distribution of the Rfree values that would 
have been obtained from all disjoined sets). To tackle this, we usually 
resort to the following ad hoc procedure :

 At an early stage of the positional refinement, we use a shell script 
which (a) uses Phil's PDBSET with the NOISE keyword to randomly shift 
atomic positions, (b) refine the resulting models with each of the 
different free sets to completion, (c) Calculate the mean of the resulting 
free R values, (d) Select (once and for all) the free set which is closer 
to the mean of the Rfree values obtained above.

For structures with a small number of reflections, the statistical noise 
in the 5% sets can be very significant indeed. We have seen differences 
between Rfree values obtained from different sets reaching up to 4%. 

Ideally, and instead of PDBSET+REFMAC we should have been using simulated 
annealing (without positional refinement), but moving continuously between 
the CNS-XPLOR and CCP4 was too much for my laziness.

All the best,
Nicholas


-- 


  Dr Nicholas M. Glykos, Department of Molecular Biology
 and Genetics, Democritus University of Thrace, University Campus,
  Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620,
Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/


Re: [ccp4bb] should the final model be refined against full datset

2011-10-15 Thread Anastassis Perrakis
 
 
 For structures with a small number of reflections, the statistical noise 
 in the 5% sets can be very significant indeed. We have seen differences 
 between Rfree values obtained from different sets reaching up to 4%. 

This is very intriguing indeed!
Is there something specific in these structures that Rfree differences depending
on the set used reach 4%? NCS? Or the 5% set having less than ~1000-1500 
reflections?

It would be indeed very interesting if there was a correlation there!

A.

 
 Ideally, and instead of PDBSET+REFMAC we should have been using simulated 
 annealing (without positional refinement), but moving continuously between 
 the CNS-XPLOR and CCP4 was too much for my laziness.
 
 All the best,
 Nicholas
 
 
 -- 
 
 
  Dr Nicholas M. Glykos, Department of Molecular Biology
 and Genetics, Democritus University of Thrace, University Campus,
  Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620,
Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/


Re: [ccp4bb] should the final model be refined against full datset

2011-10-15 Thread Nicholas M Glykos
  For structures with a small number of reflections, the statistical 
  noise in the 5% sets can be very significant indeed. We have seen 
  differences between Rfree values obtained from different sets reaching 
  up to 4%.
 
 This is very intriguing indeed! Is there something specific in these 
 structures that Rfree differences depending on the set used reach 4%? 
 NCS? Or the 5% set having less than ~1000-1500 reflections?

Tassos, by your standards, these structures should have been described as 
'tiny' and not small ... ;-)   [Yes, significantly less than 1000. In one 
case the _total_ number of reflections was 5132 reflections (which were, 
nevertheless, slowly and meticulously measured by a CAD4 one-by-one. These 
were the days ... :-)) ].




-- 


  Dr Nicholas M. Glykos, Department of Molecular Biology
 and Genetics, Democritus University of Thrace, University Campus,
  Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620,
Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/


[ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Ed Pozharski
This is a follow up (or a digression) to James comparing test set to
missing reflections.  I also heard this issue mentioned before but was
always too lazy to actually pursue it.

So.

The role of the test set is to prevent overfitting.  Let's say I have
the final model and I monitored the Rfree every step of the way and can
conclude that there is no overfitting.  Should I do the final refinement
against complete dataset?

IMCO, I absolutely should.  The test set reflections contain
information, and the final model is actually biased towards the
working set.  Refining using all the data can only improve the accuracy
of the model, if only slightly.

The second question is practical.  Let's say I want to deposit the
results of the refinement against the full dataset as my final model.
Should I not report the Rfree and instead insert a remark explaining the
situation?  If I report the Rfree prior to the test set removal, it is
certain that every validation tool will report a mismatch.  It does not
seem that the PDB has a mechanism to deal with this.

Cheers,

Ed.



-- 
Oh, suddenly throwing a giraffe into a volcano to make water is crazy?
Julian, King of Lemurs


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Nat Echols
On Fri, Oct 14, 2011 at 12:52 PM, Ed Pozharski epozh...@umaryland.eduwrote:

 The second question is practical.  Let's say I want to deposit the
 results of the refinement against the full dataset as my final model.
 Should I not report the Rfree and instead insert a remark explaining the
 situation?  If I report the Rfree prior to the test set removal, it is
 certain that every validation tool will report a mismatch.  It does not
 seem that the PDB has a mechanism to deal with this.


You should enter the statistics for the model and data that you actually
deposit, not statistics for some other model that you might have had at one
point but which the PDB will never see.  Not only does refining against
R-free make it impossible to verify and validate your structure, it also
means that any time you or anyone else wants to solve an isomorphous
structure by MR using your structure as a search model, or continue the
refinement with higher-resolution data, you will be starting with a model
that has been refined against all reflections.  So any future refinements
done with that model against isomorphous data are pre-biased, making your
model potentially useless.

I'm amazed that anyone is still depositing structures refined against all
data, but the PDB does still get a few.  The benefit of including those
extra 5% of data is always minimal in every paper I've seen that reports
such a procedure, and far outweighed by having a reliable and relatively
unbiased validation statistic that is preserved in the final deposition.
 (The situation may be different for very low resolution data, but those
structures are a tiny fraction of the PDB.)

-Nat


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Robbie Joosten

Hi Ed,
 

 This is a follow up (or a digression) to James comparing test set to
 missing reflections. I also heard this issue mentioned before but was
 always too lazy to actually pursue it.
 
 So.
 
 The role of the test set is to prevent overfitting. Let's say I have
 the final model and I monitored the Rfree every step of the way and can
 conclude that there is no overfitting. Should I do the final refinement
 against complete dataset?
 
 IMCO, I absolutely should. The test set reflections contain
 information, and the final model is actually biased towards the
 working set. Refining using all the data can only improve the accuracy
 of the model, if only slightly.
Hmm, if your R-free set is small the added value will also be small. If it is 
relatively big, then your previously established optimal weights may no longer 
be optimal. A more elegant thing to would be refine the model with, say, 20 
different 5% R-free sets, deposit the ensemble and report the average R(-free) 
plus a standard deviation. AFAIK, this is what the R-free set numbers that 
CCP4's FREERFLAG generates are for. Of course, in that case you should do 
enough refinement (and perhaps rebuilding) to make sure each R-free set is 
free. 

 The second question is practical. Let's say I want to deposit the
 results of the refinement against the full dataset as my final model.
 Should I not report the Rfree and instead insert a remark explaining the
 situation? If I report the Rfree prior to the test set removal, it is
 certain that every validation tool will report a mismatch. It does not
 seem that the PDB has a mechanism to deal with this.
The deposited R-free sets in the PDB are quite frequently 'unfree' or the wrong 
set was deposited (checking this is one of the recommendations in the VTF 
report in Structure). So at the moment you would probably get away with 
depositing an unfree R-free set ;)
 
Cheers,
Robbie
 
 
 
 Cheers,
 
 Ed.
 
 
 
 -- 
 Oh, suddenly throwing a giraffe into a volcano to make water is crazy?
 Julian, King of Lemurs
  

Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Quyen Hoang
Sorry, I don't quite understand your reasoning for how the structure  
is rendered useless if one refined it with all data.
Would your argument also apply to all the structures that were refined  
before R-free existed?


Quyen



You should enter the statistics for the model and data that you  
actually deposit, not statistics for some other model that you might  
have had at one point but which the PDB will never see.  Not only  
does refining against R-free make it impossible to verify and  
validate your structure, it also means that any time you or anyone  
else wants to solve an isomorphous structure by MR using your  
structure as a search model, or continue the refinement with higher- 
resolution data, you will be starting with a model that has been  
refined against all reflections.  So any future refinements done  
with that model against isomorphous data are pre-biased, making your  
model potentially useless.


I'm amazed that anyone is still depositing structures refined  
against all data, but the PDB does still get a few.  The benefit of  
including those extra 5% of data is always minimal in every paper  
I've seen that reports such a procedure, and far outweighed by  
having a reliable and relatively unbiased validation statistic that  
is preserved in the final deposition.  (The situation may be  
different for very low resolution data, but those structures are a  
tiny fraction of the PDB.)


-Nat


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Craig A. Bingman
Recent experience indicates that the PDB is checking these statistics very 
closely for new depositions.  The checks made by the PDB are intended to 
prevent accidents and oversights made by honest people from creeping into the 
database.  Getting away with something seems to imply some intention to 
deceive, and that is much more difficult to detect.

On Oct 14, 2011, at 3:09 PM, Robbie Joosten wrote:

 The deposited R-free sets in the PDB are quite frequently 'unfree' or the 
 wrong set was deposited (checking this is one of the recommendations in the 
 VTF report in Structure). So at the moment you would probably get away with 
 depositing an unfree R-free set ;)
 



Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Jan Dohnalek
Regarding refinement against all reflections: the main goal of our work is
to provide the best possible representation of the experimental data in the
form of the structure model. Once the structure building and refinement
process is finished keeping the Rfree set separate does not make sense any
more. Its role finishes once the last set of changes have been done to the
model and verified ...

J. Dohnalek


On Fri, Oct 14, 2011 at 10:23 PM, Craig A. Bingman 
cbing...@biochem.wisc.edu wrote:

 Recent experience indicates that the PDB is checking these statistics very
 closely for new depositions.  The checks made by the PDB are intended to
 prevent accidents and oversights made by honest people from creeping into
 the database.  Getting away with something seems to imply some intention
 to deceive, and that is much more difficult to detect.

 On Oct 14, 2011, at 3:09 PM, Robbie Joosten wrote:

 The deposited R-free sets in the PDB are quite frequently 'unfree' or the
 wrong set was deposited (checking this is one of the recommendations in the
 VTF report in Structure). So at the moment you would probably get away with
 depositing an unfree R-free set ;)





-- 
Jan Dohnalek, Ph.D
Institute of Macromolecular Chemistry
Academy of Sciences of the Czech Republic
Heyrovskeho nam. 2
16206 Praha 6
Czech Republic

Tel: +420 296 809 390
Fax: +420 296 809 410


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Nat Echols
On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang qqho...@gmail.com wrote:

 Sorry, I don't quite understand your reasoning for how the structure is
 rendered useless if one refined it with all data.


Useless was too strong a word (it's Friday, sorry).  I guess simulated
annealing can address the model-bias issue, but I'm not totally convinced
that this solves the problem.  And not every crystallographer will run SA
every time he/she solves an isomorphous structure, so there's a real danger
of misleading future users of the PDB file.  The reported R-free, of course,
is still meaningless in the context of the deposited model.

Would your argument also apply to all the structures that were refined
 before R-free existed?


Technically, yes - but how many proteins are there whose only
representatives in the PDB were refined this way?  I suspect very few; in
most cases, a more recent model should be available.

-Nat


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Quyen Hoang
I still don't understand how a structure model refined with all data  
would negatively affect the determination and/or refinement of an  
isomorphous structure using a different data set (even without doing  
SA first).


Quyen

On Oct 14, 2011, at 4:35 PM, Nat Echols wrote:

On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang qqho...@gmail.com  
wrote:
Sorry, I don't quite understand your reasoning for how the structure  
is rendered useless if one refined it with all data.


Useless was too strong a word (it's Friday, sorry).  I guess  
simulated annealing can address the model-bias issue, but I'm not  
totally convinced that this solves the problem.  And not every  
crystallographer will run SA every time he/she solves an isomorphous  
structure, so there's a real danger of misleading future users of  
the PDB file.  The reported R-free, of course, is still meaningless  
in the context of the deposited model.


Would your argument also apply to all the structures that were  
refined before R-free existed?


Technically, yes - but how many proteins are there whose only  
representatives in the PDB were refined this way?  I suspect very  
few; in most cases, a more recent model should be available.


-Nat




Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Phil Jeffrey
Let's say you have two isomorphous crystals of two different 
protein-ligand complexes.  Same protein different ligand, same xtal 
form.  Conventionally you'd keep the same free set reflections (hkl 
values) between the two datasets to reduce biasing.  However if the 
first model had been refined against all reflections there is no longer 
a free set for that model, thus all hkl's have seen the atoms during 
refinement, and so your R-free in the second complex is initially biased 
to the model from the first complex. [*]


The tendency is to do less refinement in these sort of isomorphous cases 
than in molecular replacement solutions, because the structural changes 
are usually far less (it is isomorphous after all) so there's a risk 
that the R-free will not be allowed to fully float free of that initial 
bias.  That makes your R-free look better than it actually is.


This is rather strongly analogous to using different free sets in the 
two datasets.


However I'm not sure that this is as big of a deal as it is being made 
to sound.  It can be dealt with straightforwardly.  However refining 
against all the data weakens the use of R-free as a validation tool for 
that particular model so the people that like to judge structures based 
on a single number (i.e. R-free) are going to be quite put out.


It's also the case that the best model probably *is* the one based on a 
careful last round of refinement against all data, as long as nothing 
much changes.  That would need to be quantified in some way(s).


Phil Jeffrey
Princeton

[* Your R-free is also initially model-biased in cases where the data 
are significant non-isomorphous or you're using two different xtal 
forms, to varying extents]





I still don't understand how a structure model refined with all data
would negatively affect the determination and/or refinement of an
isomorphous structure using a different data set (even without doing SA
first).

Quyen

On Oct 14, 2011, at 4:35 PM, Nat Echols wrote:


On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang qqho...@gmail.com
mailto:qqho...@gmail.com wrote:

Sorry, I don't quite understand your reasoning for how the
structure is rendered useless if one refined it with all data.


Useless was too strong a word (it's Friday, sorry). I guess
simulated annealing can address the model-bias issue, but I'm not
totally convinced that this solves the problem. And not every
crystallographer will run SA every time he/she solves an isomorphous
structure, so there's a real danger of misleading future users of the
PDB file. The reported R-free, of course, is still meaningless in the
context of the deposited model.

Would your argument also apply to all the structures that were
refined before R-free existed?


Technically, yes - but how many proteins are there whose only
representatives in the PDB were refined this way? I suspect very few;
in most cases, a more recent model should be available.

-Nat




Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Felix Frolow
Recently we (I mean WE - community) frequently refine structures around 1 
Angstrom resolution.
This is not what  for the Rfree was invented. It was invented to go away with 
3.0-2.8 Angstrom data
in times when people did not possess  facilities good enough to look on the 
electron density maps….
We finish (WE - I again mean - community) the refinement of our structures too 
early.

Dr Felix Frolow   
Professor of Structural Biology and Biotechnology
Department of Molecular Microbiology
and Biotechnology
Tel Aviv University 69978, Israel

Acta Crystallographica F, co-editor

e-mail: mbfro...@post.tau.ac.il
Tel:  ++972-3640-8723
Fax: ++972-3640-9407
Cellular: 0547 459 608

On Oct 14, 2011, at 22:35 , Nat Echols wrote:

 On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang qqho...@gmail.com wrote:
 Sorry, I don't quite understand your reasoning for how the structure is 
 rendered useless if one refined it with all data.
 
 Useless was too strong a word (it's Friday, sorry).  I guess simulated 
 annealing can address the model-bias issue, but I'm not totally convinced 
 that this solves the problem.  And not every crystallographer will run SA 
 every time he/she solves an isomorphous structure, so there's a real danger 
 of misleading future users of the PDB file.  The reported R-free, of course, 
 is still meaningless in the context of the deposited model.
 
 Would your argument also apply to all the structures that were refined before 
 R-free existed?
 
 Technically, yes - but how many proteins are there whose only representatives 
 in the PDB were refined this way?  I suspect very few; in most cases, a more 
 recent model should be available.
 
 -Nat



Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Ed Pozharski
On Fri, 2011-10-14 at 13:07 -0700, Nat Echols wrote:

 You should enter the statistics for the model and data that you
 actually deposit, not statistics for some other model that you might
 have had at one point but which the PDB will never see.  

If you read my post carefully, you'll see that I never suggested
reporting statistics for one model and depositing the other

 Not only does refining against R-free make it impossible to verify and
 validate your structure, it also means that any time you or anyone
 else wants to solve an isomorphous structure by MR using your
 structure as a search model, or continue the refinement with
 higher-resolution data, you will be starting with a model that has
 been refined against all reflections.  So any future refinements done
 with that model against isomorphous data are pre-biased, making your
 model potentially useless.

Frankly, I think you are exaggerating the magnitude of model bias in the
situation that I described.  You assume that the refinement will become
severely unstable after tossing in the test reflections.  Depending on
the resolution etc, the rms shift of the model may vary but if it even
is, say half an angstrom the model hardly becomes useless (and that is
hugely overestimated).  And at least in theory including *all the data*
should make the model more, not less accurate.

 The benefit of including those extra 5% of data is always minimal 

And so is probably the benefit of excluding when all the steps that
require cross-validation have already been performed.  My thinking is
that excluding data from analysis should always be justified (and in the
initial stages of refinement, it might be as it prevents overfitting),
not the other way around.

Cheers,

Ed.

-- 
Hurry up before we all come back to our senses!
   Julian, King of Lemurs


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Craig A. Bingman
We have obligations that extend beyond simply presenting a best model.  

In an ideal world, the PDB would accept two coordinate sets and two sets of 
statistics, one for the last step where the cross-validation set was valid, and 
a final model refined against all the data.  Until there is a clear way to do 
that, and an unambiguous presentation of them to the public, IMO, the gains won 
by refinement against all the data are outweighed by the Confusion that it can 
cause when presenting model and associated statistics to the public.


On Oct 14, 2011, at 3:32 PM, Jan Dohnalek wrote:

 Regarding refinement against all reflections: the main goal of our work is to 
 provide the best possible representation of the experimental data in the form 
 of the structure model. Once the structure building and refinement process is 
 finished keeping the Rfree set separate does not make sense any more. Its 
 role finishes once the last set of changes have been done to the model and 
 verified ...
 
 J. Dohnalek


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Quyen Hoang

Thanks for the clear explanation. I understood that.
But I was trying to understand how this would negatively affects the  
initial model to render it useless or less useful.
In the scenario that you presented, I would expect a better result  
(better model) if the initial model was refined with all data, thus  
more useful.
Sure, again in your scenario, the new structure has seen R-free  
reflections in the equivalent indexes of its replacement model, but  
their intensities should be different anyway, so I am not sure how  
this is bad. Even if the bias is huge, let's say this bias results in  
1% reduction in initial R-free (exaggerating here), how would this  
makes one's model bad or how would this be bad for one's science?
In the end, our objective is to build the best model possible and I  
think that more data would likely result in better model, not the  
other way around. If we can agree that refining a model with all data  
would result in a better model, then wouldn't not doing so constitute  
a compromise of model quality for a more pure statistic?


I had not refined a model with all data before (just to keep inline),  
but I  wondered if I was doing the best thing.


Cheers,
Quyen
On Oct 14, 2011, at 5:27 PM, Phil Jeffrey wrote:

Let's say you have two isomorphous crystals of two different protein- 
ligand complexes.  Same protein different ligand, same xtal form.   
Conventionally you'd keep the same free set reflections (hkl values)  
between the two datasets to reduce biasing.  However if the first  
model had been refined against all reflections there is no longer a  
free set for that model, thus all hkl's have seen the atoms during  
refinement, and so your R-free in the second complex is initially  
biased to the model from the first complex. [*]


The tendency is to do less refinement in these sort of isomorphous  
cases than in molecular replacement solutions, because the  
structural changes are usually far less (it is isomorphous after  
all) so there's a risk that the R-free will not be allowed to fully  
float free of that initial bias.  That makes your R-free look better  
than it actually is.


This is rather strongly analogous to using different free sets in  
the two datasets.


However I'm not sure that this is as big of a deal as it is being  
made to sound.  It can be dealt with straightforwardly.  However  
refining against all the data weakens the use of R-free as a  
validation tool for that particular model so the people that like to  
judge structures based on a single number (i.e. R-free) are going to  
be quite put out.


It's also the case that the best model probably *is* the one based  
on a careful last round of refinement against all data, as long as  
nothing much changes.  That would need to be quantified in some  
way(s).


Phil Jeffrey
Princeton

[* Your R-free is also initially model-biased in cases where the  
data are significant non-isomorphous or you're using two different  
xtal forms, to varying extents]





I still don't understand how a structure model refined with all data
would negatively affect the determination and/or refinement of an
isomorphous structure using a different data set (even without  
doing SA

first).

Quyen

On Oct 14, 2011, at 4:35 PM, Nat Echols wrote:


On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang qqho...@gmail.com
mailto:qqho...@gmail.com wrote:

   Sorry, I don't quite understand your reasoning for how the
   structure is rendered useless if one refined it with all data.


Useless was too strong a word (it's Friday, sorry). I guess
simulated annealing can address the model-bias issue, but I'm not
totally convinced that this solves the problem. And not every
crystallographer will run SA every time he/she solves an isomorphous
structure, so there's a real danger of misleading future users of  
the
PDB file. The reported R-free, of course, is still meaningless in  
the

context of the deposited model.

   Would your argument also apply to all the structures that were
   refined before R-free existed?


Technically, yes - but how many proteins are there whose only
representatives in the PDB were refined this way? I suspect very  
few;

in most cases, a more recent model should be available.

-Nat






Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Ethan Merritt
On Friday, October 14, 2011 02:45:08 pm Ed Pozharski wrote:
 On Fri, 2011-10-14 at 13:07 -0700, Nat Echols wrote:
  
  The benefit of including those extra 5% of data is always minimal 
 
 And so is probably the benefit of excluding when all the steps that
 require cross-validation have already been performed.  My thinking is
 that excluding data from analysis should always be justified (and in the
 initial stages of refinement, it might be as it prevents overfitting),
 not the other way around.

A model with error bars is more useful than a marginally more
accurate model without error bars, not least because you are probably
taking it on faith that the second model is more accurate.

Crystallographers were kind of late in realizing that a cross validation
test could be useful in assessing refinement.  What's more, we 
never really learned the whole lesson.  Rather than using the full
test, we use only one blade of the jackknife.  

http://en.wikipedia.org/wiki/Cross-validation_(statistics)#K-fold_cross-validation

The full test would involve running multiple parallel refinements, 
each one omiting a different disjoint set of reflections.  
The ccp4 suite is set up to do this,  since Rfree flags by default run 
from 0-19 and refmac lets you specify which 5% subset is to be omitted
from the current run. Of course, evaluating the end point becomes more
complex than looking at a single number Rfree.

Surely someone must have done this!  But I can't recall ever reading
an analysis of such a refinement protocol.  
Does anyone know of relevant reports in the literature?

Is there a program or script that will collect K-fold parallel output
models and their residuals to generate a net indicator of model quality?

Ethan

-- 
Ethan A Merritt
Biomolecular Structure Center,  K-428 Health Sciences Bldg
University of Washington, Seattle 98195-7742


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Phil Evans
I just tried refining a finished structure turning off the FreeR set, in 
Refmac, and I have to say I can barely see any difference between the two sets 
of coordinates.

From this n=1 trial, I can't see that it improves the model significantly, nor 
that it ruins the model irretrievably for future purposes.   

I suspect we worry too much about these things

Phil Evans

On 14 Oct 2011, at 21:35, Nat Echols wrote:

 On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang qqho...@gmail.com wrote:
 Sorry, I don't quite understand your reasoning for how the structure is 
 rendered useless if one refined it with all data.
 
 Useless was too strong a word (it's Friday, sorry).  I guess simulated 
 annealing can address the model-bias issue, but I'm not totally convinced 
 that this solves the problem.  And not every crystallographer will run SA 
 every time he/she solves an isomorphous structure, so there's a real danger 
 of misleading future users of the PDB file.  The reported R-free, of course, 
 is still meaningless in the context of the deposited model.
 
 Would your argument also apply to all the structures that were refined before 
 R-free existed?
 
 Technically, yes - but how many proteins are there whose only representatives 
 in the PDB were refined this way?  I suspect very few; in most cases, a more 
 recent model should be available.
 
 -Nat


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Thomas C. Terwilliger
For those who have strong opinions on what data should be deposited...

The IUCR is just starting a serious discussion of this subject. Two
committees, the Data Deposition Working Group, led by John Helliwell,
and the Commission on Biological Macromolecules (chaired by Xiao-Dong Su)
are working on this.

Two key issues are (1) feasibility and importance of deposition of raw
images and (2) deposition of sufficient information to fully reproduce the
crystallographic analysis.

I am on both committees and would be happy to hear your ideas (off-list). 
I am sure the other members of the committees would welcome your thoughts
as well.

-Tom T

Tom Terwilliger
terwilli...@lanl.gov


 This is a follow up (or a digression) to James comparing test set to
 missing reflections.  I also heard this issue mentioned before but was
 always too lazy to actually pursue it.

 So.

 The role of the test set is to prevent overfitting.  Let's say I have
 the final model and I monitored the Rfree every step of the way and can
 conclude that there is no overfitting.  Should I do the final refinement
 against complete dataset?

 IMCO, I absolutely should.  The test set reflections contain
 information, and the final model is actually biased towards the
 working set.  Refining using all the data can only improve the accuracy
 of the model, if only slightly.

 The second question is practical.  Let's say I want to deposit the
 results of the refinement against the full dataset as my final model.
 Should I not report the Rfree and instead insert a remark explaining the
 situation?  If I report the Rfree prior to the test set removal, it is
 certain that every validation tool will report a mismatch.  It does not
 seem that the PDB has a mechanism to deal with this.

 Cheers,

 Ed.



 --
 Oh, suddenly throwing a giraffe into a volcano to make water is crazy?
 Julian, King of Lemurs



Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Gerard Bricogne
Dear Tom,

 I am not sure that I feel happy with your invitation that views on such
crucial matters as these deposition issues be communicated to you off-list.
It would seem much healthier if these views were aired out within the BB. 
Again!, some will say ... but the difference is that there is now a forum
for them, set up by the IUCr, that may eventually turn opinions into some
form of action.

 I am sure that many subscribers to this BB, and not just you as a
member of some committees, would be interested to hear the full variety of
views on the desirable and the feasible in these areas, and to express their
own for everyone to read and discuss.

 Perhaps John Helliwell can elaborate on this and on the newly created
forum.


 With best wishes,
 
  Gerard.

--
On Fri, Oct 14, 2011 at 04:56:20PM -0600, Thomas C. Terwilliger wrote:
 For those who have strong opinions on what data should be deposited...
 
 The IUCR is just starting a serious discussion of this subject. Two
 committees, the Data Deposition Working Group, led by John Helliwell,
 and the Commission on Biological Macromolecules (chaired by Xiao-Dong Su)
 are working on this.
 
 Two key issues are (1) feasibility and importance of deposition of raw
 images and (2) deposition of sufficient information to fully reproduce the
 crystallographic analysis.
 
 I am on both committees and would be happy to hear your ideas (off-list). 
 I am sure the other members of the committees would welcome your thoughts
 as well.
 
 -Tom T
 
 Tom Terwilliger
 terwilli...@lanl.gov
 
 
  This is a follow up (or a digression) to James comparing test set to
  missing reflections.  I also heard this issue mentioned before but was
  always too lazy to actually pursue it.
 
  So.
 
  The role of the test set is to prevent overfitting.  Let's say I have
  the final model and I monitored the Rfree every step of the way and can
  conclude that there is no overfitting.  Should I do the final refinement
  against complete dataset?
 
  IMCO, I absolutely should.  The test set reflections contain
  information, and the final model is actually biased towards the
  working set.  Refining using all the data can only improve the accuracy
  of the model, if only slightly.
 
  The second question is practical.  Let's say I want to deposit the
  results of the refinement against the full dataset as my final model.
  Should I not report the Rfree and instead insert a remark explaining the
  situation?  If I report the Rfree prior to the test set removal, it is
  certain that every validation tool will report a mismatch.  It does not
  seem that the PDB has a mechanism to deal with this.
 
  Cheers,
 
  Ed.
 
 
 
  --
  Oh, suddenly throwing a giraffe into a volcano to make water is crazy?
  Julian, King of Lemurs
 

-- 

 ===
 * *
 * Gerard Bricogne g...@globalphasing.com  *
 * *
 * Global Phasing Ltd. *
 * Sheraton House, Castle Park Tel: +44-(0)1223-353033 *
 * Cambridge CB3 0AX, UK   Fax: +44-(0)1223-366889 *
 * *
 ===


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread D Bonsor
I may be missing something or someone could point out that I am wrong and why 
as I am curious, but with a highly redundant dataset the difference between 
refining the final model against the full dataset would be small based upon the 
random selection of reflections for Rfree? 


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Thomas C. Terwilliger
Dear Gerard,

I'm very happy for the discussion to be on the CCP4 list (or on the IUCR
forums, or both).  I was only trying to not create too much traffic.

All the best,
Tom T

 Dear Tom,

  I am not sure that I feel happy with your invitation that views on
 such
 crucial matters as these deposition issues be communicated to you
 off-list.
 It would seem much healthier if these views were aired out within the BB.
 Again!, some will say ... but the difference is that there is now a forum
 for them, set up by the IUCr, that may eventually turn opinions into some
 form of action.

  I am sure that many subscribers to this BB, and not just you as a
 member of some committees, would be interested to hear the full variety of
 views on the desirable and the feasible in these areas, and to express
 their
 own for everyone to read and discuss.

  Perhaps John Helliwell can elaborate on this and on the newly created
 forum.


  With best wishes,

   Gerard.

 --
 On Fri, Oct 14, 2011 at 04:56:20PM -0600, Thomas C. Terwilliger wrote:
 For those who have strong opinions on what data should be deposited...

 The IUCR is just starting a serious discussion of this subject. Two
 committees, the Data Deposition Working Group, led by John Helliwell,
 and the Commission on Biological Macromolecules (chaired by Xiao-Dong
 Su)
 are working on this.

 Two key issues are (1) feasibility and importance of deposition of raw
 images and (2) deposition of sufficient information to fully reproduce
 the
 crystallographic analysis.

 I am on both committees and would be happy to hear your ideas
 (off-list).
 I am sure the other members of the committees would welcome your
 thoughts
 as well.

 -Tom T

 Tom Terwilliger
 terwilli...@lanl.gov


  This is a follow up (or a digression) to James comparing test set to
  missing reflections.  I also heard this issue mentioned before but
 was
  always too lazy to actually pursue it.
 
  So.
 
  The role of the test set is to prevent overfitting.  Let's say I have
  the final model and I monitored the Rfree every step of the way and
 can
  conclude that there is no overfitting.  Should I do the final
 refinement
  against complete dataset?
 
  IMCO, I absolutely should.  The test set reflections contain
  information, and the final model is actually biased towards the
  working set.  Refining using all the data can only improve the
 accuracy
  of the model, if only slightly.
 
  The second question is practical.  Let's say I want to deposit the
  results of the refinement against the full dataset as my final model.
  Should I not report the Rfree and instead insert a remark explaining
 the
  situation?  If I report the Rfree prior to the test set removal, it
 is
  certain that every validation tool will report a mismatch.  It does
 not
  seem that the PDB has a mechanism to deal with this.
 
  Cheers,
 
  Ed.
 
 
 
  --
  Oh, suddenly throwing a giraffe into a volcano to make water is
 crazy?
  Julian, King of
 Lemurs
 

 --

  ===
  * *
  * Gerard Bricogne g...@globalphasing.com  *
  * *
  * Global Phasing Ltd. *
  * Sheraton House, Castle Park Tel: +44-(0)1223-353033 *
  * Cambridge CB3 0AX, UK   Fax: +44-(0)1223-366889 *
  * *
  ===



Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Edward A. Berry

Now it would be interesting to refine this structure to convergence,
with the original free set. If I understood correctly Ian Tickle has
done essentially this, and the Free R returns essentially to its
original value: the minimum arrived at is independent of starting
point, perhaps within  limitation that one might get caught in a
different false minimum (which is unlikely given the miniscule changes
you see). If that is the case we should stop worrying about
corrupting the free set by refining against it or even using it
to make maps in which models will be adjusted.
This is a perennial discussion but I never saw the report that
in fact original free-R is _not_ recoverable by refining to
convergence.

Phil Evans wrote:

I just tried refining a finished structure turning off the FreeR set, in 
Refmac, and I have to say I can barely see any difference between the two sets of 
coordinates.

 From this n=1 trial, I can't see that it improves the model significantly, nor 
that it ruins the model irretrievably for future purposes.

I suspect we worry too much about these things

Phil Evans



Now it would be interesting to refine this structure to convergence,
with the original free set. If I understood correctly Ian Tickle has
done essentially this, and the Free R returns essentially to its
original value: the minimum arrived at is independent of starting
point, perhaps within  limitation that one might get caught in a
different false minimum (which is unlikely given the miniscule changes
you see). If that is the case we should stop worrying about
corrupting the free set by refining against it or even using it
to make maps in which models will be adjusted.
This is a perennial discussion but I never saw the report that
in fact original free-R is _not_ recoverable by refining to
convergence.
Indeed, perhaps we worry too much about such things.



On 14 Oct 2011, at 21:35, Nat Echols wrote:


On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoangqqho...@gmail.com  wrote:
Sorry, I don't quite understand your reasoning for how the structure is 
rendered useless if one refined it with all data.

Useless was too strong a word (it's Friday, sorry).  I guess simulated 
annealing can address the model-bias issue, but I'm not totally convinced that this 
solves the problem.  And not every crystallographer will run SA every time he/she solves 
an isomorphous structure, so there's a real danger of misleading future users of the PDB 
file.  The reported R-free, of course, is still meaningless in the context of the 
deposited model.

Would your argument also apply to all the structures that were refined before 
R-free existed?

Technically, yes - but how many proteins are there whose only representatives 
in the PDB were refined this way?  I suspect very few; in most cases, a more 
recent model should be available.

-Nat




Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread James Stroud
Each R-free flag corresponds a particular HKL index. Redundancy refers to the 
number of times a reflection corresponding to a given HKL index is observed. 
The final structure factor of a given HKL can be thought of as an average of 
these redundant observations.

Related to your question, someone once mentioned that for each particular space 
group, there should be a preferred R-free assignment. As far as I know, nothing 
tangible ever came of that idea.

James



On Oct 14, 2011, at 5:34 PM, D Bonsor wrote:

 I may be missing something or someone could point out that I am wrong and why 
 as I am curious, but with a highly redundant dataset the difference between 
 refining the final model against the full dataset would be small based upon 
 the random selection of reflections for Rfree? 



Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Ed Pozharski
On Fri, 2011-10-14 at 23:41 +0100, Phil Evans wrote:
 I just tried refining a finished structure turning off the FreeR
 set, in Refmac, and I have to say I can barely see any difference
 between the two sets of coordinates.

The amplitude of the shift, I presume, depends on the resolution and
data quality.  With a very good 1.2A dataset refined with anisotropic
B-factors to R~14% what I see is ~0.005A rms shift.  Which is not much,
however the reported ML DPI is ~0.02A, so perhaps the effect is not that
small compared to the precision of the model.  

On the other hand, the more normal example at 1.7A (and very good data
refining down to R~15%) shows ~0.03A general variation with a variable
test set.  Again, not much, but the ML DPI in this case is ~0.06A -
comparable to the variation induced by the choice of the test set.

Cheers,

Ed.

-- 
Hurry up, before we all come back to our senses!
  Julian, King of Lemurs


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Pavel Afonine
Hi,

yes, shifts depend on resolution indeed. See pages 75-77 here:

http://www.phenix-online.org/presentations/latest/pavel_refinement_general.pdf

Pavel

On Fri, Oct 14, 2011 at 7:34 PM, Ed Pozharski epozh...@umaryland.eduwrote:

 On Fri, 2011-10-14 at 23:41 +0100, Phil Evans wrote:
  I just tried refining a finished structure turning off the FreeR
  set, in Refmac, and I have to say I can barely see any difference
  between the two sets of coordinates.

 The amplitude of the shift, I presume, depends on the resolution and
 data quality.  With a very good 1.2A dataset refined with anisotropic
 B-factors to R~14% what I see is ~0.005A rms shift.  Which is not much,
 however the reported ML DPI is ~0.02A, so perhaps the effect is not that
 small compared to the precision of the model.

 On the other hand, the more normal example at 1.7A (and very good data
 refining down to R~15%) shows ~0.03A general variation with a variable
 test set.  Again, not much, but the ML DPI in this case is ~0.06A -
 comparable to the variation induced by the choice of the test set.

 Cheers,

 Ed.

 --
 Hurry up, before we all come back to our senses!
  Julian, King of Lemurs