Re: [ccp4bb] should the final model be refined against full datset

2011-10-18 Thread Ed Pozharski
> Selecting a test set that minimizes Rfree is so wrong on so many levels.
> Unless, of course, the only thing I know about Rfree is that it is the
> magic number that I need to make small by all means necessary.

By using a simple genetic algorithm, I managed to get Rfree for a
well-refined model as low as 14.6% and as high as 19.1%.  The dataset is
not too small (~40,000 reflection in all with the standard sized 5% test
set).  So you can get spread as wide as 4.5% even with not-so-small
dataset.  Only ~1/3 of test reflections are exchanged to achieve this.

What's curious is that, contrary to my expectations, the test set
remains well distributed throughout resolution shells upon this awful
"optimization" and the  for the working set and test set remain
close.  Not sure how to judge which model is actually better, but it's
noteworthy that the FOM gets worse for *both* upward and downward
"optimization" of the test set.


-- 
After much deep and profound brain things inside my head, 
I have decided to thank you for bringing peace to our home.
Julian, King of Lemurs


Re: [ccp4bb] should the final model be refined against full datset

2011-10-17 Thread Pavel Afonine
Yes, Rsleep seems to be just the right thing to use for this:

Separating model optimization and model validation in statistical
cross-validation as applied to crystallography
G. J. Kleywegt
Acta Cryst. (2007). D63, 939-940

Practically, it would mean that we split 10% of test reflections into 5%
used for optimizations like #1-4, and the other 5% (sleep set) is never ever
used for anything. The big question here is: whether this will make any
important difference? I suspect, as with many similar things, there will be
no clear-cut answer (that is it may or may not make difference, case
dependent).

Pavel

On Mon, Oct 17, 2011 at 8:57 AM, Thomas C. Terwilliger  wrote:

> I think that we are using the test set for many things:
>
> 1. Determining and communicating to others whether our overall procedure
> is overfitting the data.
>
> 2. Identifying the optimal overall procedure in cases where very different
> options are being considered (e.g., should I use TLS).
>
> 3. Calculating specific parameters (eg sigmaA).
>
> 4. Identifying the "best" set of overall parameters.
>
> I would suggest that we should generally restrict our usage of the test
> set to purposes #1-3.  Given a particular overall procedure for
> refinement, a very good set of parameters should be obtainable from the
> working set of data.
>
> In particular, approaches in which many parameters (in the limit... all
> parameters) are fit to minimize Rfree do not seem likely to produce the
> best model overall.  It might be worth doing some experiments with the
> super-free set approach to determine whether this is true.
>
>
> >> Hi,
> >>
> >> On Sun, Oct 16, 2011 at 7:48 PM, Ed Pozharski
> >> wrote:
> >>
> >>> On Sat, 2011-10-15 at 11:48 +0300, Nicholas M Glykos wrote:
> >>> > > > For structures with a small number of reflections, the
> >>> > statistical
> >>> > > > noise in the 5% sets can be very significant indeed. We have seen
> >>> > > > differences between Rfree values obtained from different sets
> >>> > reaching
> >>> > > > up to 4%.
> >>>
> >>
> >> this is in line with my observations too.
> >> Not surprising at all, though (see my previous post on this subject): a
> >> small seemingly insignificant change somewhere may result in refinement
> >> taking a different pathway leading to a different local minimum. There
> is
> >> even way of making practical use of this (Rice, Shamoo & Brunger, 1998;
> >> Korostelev, Laurberg & Noller, 2009; ...).
> >>
> >> This "seemingly insignificant change somewhere" may be:
> >> - what Ed mentioned (different noise level in free reflections or simply
> >> different strength of reflections in free set between sets);
> >> - slightly different staring conditions (starting parameter value);
> >> - random seed used in Xray/restraints target weight calculation (applies
> >> to
> >> phenix.refine),
> >> - I can go on for 10+ possibilities.
> >>
> >> I do not know whether choosing the result with the lowest Rfree is a
> good
> >> idea or not (after reading Ed's post I am slightly puzzled now), but
> >> what's
> >> definitely a good idea in my opinion is to know the range of possible
> >> R-factor values in your specific case, so you know which difference
> >> between
> >> two R-factors obtained in two refinement runs is significant and which
> one
> >> is not.
> >>
> >> Pavel
> >>
>


Re: [ccp4bb] should the final model be refined against full datset

2011-10-17 Thread Thomas C. Terwilliger
I think that we are using the test set for many things:

1. Determining and communicating to others whether our overall procedure
is overfitting the data.

2. Identifying the optimal overall procedure in cases where very different
options are being considered (e.g., should I use TLS).

3. Calculating specific parameters (eg sigmaA).

4. Identifying the "best" set of overall parameters.

I would suggest that we should generally restrict our usage of the test
set to purposes #1-3.  Given a particular overall procedure for
refinement, a very good set of parameters should be obtainable from the
working set of data.

In particular, approaches in which many parameters (in the limit... all
parameters) are fit to minimize Rfree do not seem likely to produce the
best model overall.  It might be worth doing some experiments with the
super-free set approach to determine whether this is true.


>> Hi,
>>
>> On Sun, Oct 16, 2011 at 7:48 PM, Ed Pozharski
>> wrote:
>>
>>> On Sat, 2011-10-15 at 11:48 +0300, Nicholas M Glykos wrote:
>>> > > > For structures with a small number of reflections, the
>>> > statistical
>>> > > > noise in the 5% sets can be very significant indeed. We have seen
>>> > > > differences between Rfree values obtained from different sets
>>> > reaching
>>> > > > up to 4%.
>>>
>>
>> this is in line with my observations too.
>> Not surprising at all, though (see my previous post on this subject): a
>> small seemingly insignificant change somewhere may result in refinement
>> taking a different pathway leading to a different local minimum. There is
>> even way of making practical use of this (Rice, Shamoo & Brunger, 1998;
>> Korostelev, Laurberg & Noller, 2009; ...).
>>
>> This "seemingly insignificant change somewhere" may be:
>> - what Ed mentioned (different noise level in free reflections or simply
>> different strength of reflections in free set between sets);
>> - slightly different staring conditions (starting parameter value);
>> - random seed used in Xray/restraints target weight calculation (applies
>> to
>> phenix.refine),
>> - I can go on for 10+ possibilities.
>>
>> I do not know whether choosing the result with the lowest Rfree is a good
>> idea or not (after reading Ed's post I am slightly puzzled now), but
>> what's
>> definitely a good idea in my opinion is to know the range of possible
>> R-factor values in your specific case, so you know which difference
>> between
>> two R-factors obtained in two refinement runs is significant and which one
>> is not.
>>
>> Pavel
>>


Re: [ccp4bb] should the final model be refined against full datset

2011-10-17 Thread John R Helliwell
Dear Gerard,Tom and Bernhard,
Thankyou for highlighting the IUCr Diffraction Data Deposition Working
Group and Forum.


Dear Colleagues,
I am travelling at present and apologise for not replying sooner to
the CCP4bb, and also am with intermittent email access until later
this week when I 'return to office'.

The points being raised in this CCP4bb thread are very important and
the IUCr also recognises this.

The role of the IUCr Working Group that has been set up is to bring to
a focus information and to identify steps forward. We seek  to make
progress towards archiving and making available all relevant
scientific data associated with a publication (or a completed
structure deposition in a validated database such as the PDB). The
consultation process is being formalised via the IUCr Forum pages. The
Working Group and a wider Group consisting of IUCr Commissions and
consultants has been established for discussion and planning. We are
also aiming at a community consultation via the Forum approach and we
will launch the Forum for this asap.

The IUCr invites as wide as possible inputs, from the various
communities that the IUCr Commissions serve, on the diffraction data
deposition future, which can surely be improved. Thus this Forum will
help to record an organised set of inputs for future reference.

The Forum is being set up and will require registration, which is a
straightforward process. Details will follow shortly.

Members of the Working Group and its consulted representatives are listed below.

Best wishes and regards,
Yours sincerely,
John
Prof John R Helliwell DSc
Chairman of the IUCr Diffraction Data Deposition Working Group (IUCr DDD WG).



IUCr DDD WG Members
Steve Androulakis (TARDIS representative)
John R. Helliwell (Chair) (IUCr ICSTI Representative; Chairman of the
IUCr Journals Commission 1996-2005)
Loes Kroon-Batenburg (Data processing software)
Brian McMahon (IUCr CODATA Representative)
John Westbrook (wwPDB representative and COMCIFS)
Sol Gruner (Diffuse scattering specialist and SR Facility Director)
Heinz-Josef Weyer (SR and Neutron Facility user)
Tom Terwilliger (Macromolecular Crystallography)


Consultants:
Alun Ashton (Diamond Light Source (DLS); Data Archive leader there)
Herbert Bernstein (Head of the imgCIF Dictionary Maintenance Group and
member of COMCIFS)
Frances Bernstein (Observer on data deposition policies)
Gerard Bricogne (Active software and methods developer)
Bernhard Rupp ( Macromolecular crystallographer)

IUCr Commissions (Chairs and/or alternates).



On Sat, Oct 15, 2011 at 1:32 AM, Gerard Bricogne  wrote:
> Dear Tom,
>
>     I am not sure that I feel happy with your invitation that views on such
> crucial matters as these deposition issues be communicated to you off-list.
> It would seem much healthier if these views were aired out within the BB.
> Again!, some will say ... but the difference is that there is now a forum
> for them, set up by the IUCr, that may eventually turn opinions into some
> form of action.
>
>     I am sure that many subscribers to this BB, and not just you as a
> member of some committees, would be interested to hear the full variety of
> views on the desirable and the feasible in these areas, and to express their
> own for everyone to read and discuss.
>
>     Perhaps John Helliwell can elaborate on this and on the newly created
> forum.
>
>
>     With best wishes,
>
>          Gerard.
>
> --
> On Fri, Oct 14, 2011 at 04:56:20PM -0600, Thomas C. Terwilliger wrote:
>> For those who have strong opinions on what data should be deposited...
>>
>> The IUCR is just starting a serious discussion of this subject. Two
>> committees, the "Data Deposition Working Group", led by John Helliwell,
>> and the Commission on Biological Macromolecules (chaired by Xiao-Dong Su)
>> are working on this.
>>
>> Two key issues are (1) feasibility and importance of deposition of raw
>> images and (2) deposition of sufficient information to fully reproduce the
>> crystallographic analysis.
>>
>> I am on both committees and would be happy to hear your ideas (off-list).
>> I am sure the other members of the committees would welcome your thoughts
>> as well.
>>
>> -Tom T
>>
>> Tom Terwilliger
>> terwilli...@lanl.gov
>>
>>
>> >> This is a follow up (or a digression) to James comparing test set to
>> >> missing reflections.  I also heard this issue mentioned before but was
>> >> always too lazy to actually pursue it.
>> >>
>> >> So.
>> >>
>> >> The role of the test set is to prevent overfitting.  Let's say I have
>> >> the final model and I monitored the Rfree every step of the way and can
>> >> conclude that there is no overfitting.  Should I do the final refinement
>> >> against complete dataset?
>> >>
>> >> IMCO, I absolutely should.  The test set reflections contain
>> >> information, and the "final" model is actually biased towards the
>> >> working set.  Refining using all the data can only improve the accuracy
>> >> of the model, if only slightly.
>> >>
>> >> T

Re: [ccp4bb] should the final model be refined against full datset

2011-10-17 Thread Tim Gruene
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Dear Nicholas,

for a data set with 5132 unique reflections you should flag 10.5% for
Rfree, otherwise you could as well drop Rfree completely and use the
whole data set for refinement. At least this is how I understand Axel
Brunger's article about Rfree where he states that one needs 500-1000
reflections for a significant meaning of Rfree.

I have wondered where the '5%-rule' came in which compromises the Rfree
for low resolution data sets (especially with high symmetry).

If Axel Brunger's initial statement has become obsolete I would
appreciate some clarification on the required number of flagged
reflection, but until then I will keep on flagging 500-1000 reflections,
rather than 5%.

Tim

On 10/15/2011 10:48 AM, Nicholas M Glykos wrote:
>>> For structures with a small number of reflections, the statistical 
>>> noise in the 5% sets can be very significant indeed. We have seen 
>>> differences between Rfree values obtained from different sets reaching 
>>> up to 4%.
>>
>> This is very intriguing indeed! Is there something specific in these 
>> structures that Rfree differences depending on the set used reach 4%? 
>> NCS? Or the 5% set having less than ~1000-1500 reflections?
> 
> Tassos, by your standards, these structures should have been described as 
> 'tiny' and not small ... ;-)   [Yes, significantly less than 1000. In one 
> case the _total_ number of reflections was 5132 reflections (which were, 
> nevertheless, slowly and meticulously measured by a CAD4 one-by-one. These 
> were the days ... :-)) ].
> 
> 
> 
> 

- -- 
- --
Dr Tim Gruene
Institut fuer anorganische Chemie
Tammannstr. 4
D-37077 Goettingen

GPG Key ID = A46BEE1A

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iD8DBQFOm9p6UxlJ7aRr7hoRAkumAKD5beU+JnpRuO7TJF1232a1axMtAACdHCI5
nf8+rtr5Are0kBgmk9w0rg4=
=Agc9
-END PGP SIGNATURE-


Re: [ccp4bb] should the final model be refined against full datset

2011-10-16 Thread Pavel Afonine
Hi,

On Sun, Oct 16, 2011 at 7:48 PM, Ed Pozharski wrote:

> On Sat, 2011-10-15 at 11:48 +0300, Nicholas M Glykos wrote:
> > > > For structures with a small number of reflections, the
> > statistical
> > > > noise in the 5% sets can be very significant indeed. We have seen
> > > > differences between Rfree values obtained from different sets
> > reaching
> > > > up to 4%.
>

this is in line with my observations too.
Not surprising at all, though (see my previous post on this subject): a
small seemingly insignificant change somewhere may result in refinement
taking a different pathway leading to a different local minimum. There is
even way of making practical use of this (Rice, Shamoo & Brunger, 1998;
Korostelev, Laurberg & Noller, 2009; ...).

This "seemingly insignificant change somewhere" may be:
- what Ed mentioned (different noise level in free reflections or simply
different strength of reflections in free set between sets);
- slightly different staring conditions (starting parameter value);
- random seed used in Xray/restraints target weight calculation (applies to
phenix.refine),
- I can go on for 10+ possibilities.

I do not know whether choosing the result with the lowest Rfree is a good
idea or not (after reading Ed's post I am slightly puzzled now), but what's
definitely a good idea in my opinion is to know the range of possible
R-factor values in your specific case, so you know which difference between
two R-factors obtained in two refinement runs is significant and which one
is not.

Pavel


Re: [ccp4bb] should the final model be refined against full datset

2011-10-16 Thread Ed Pozharski
On Sat, 2011-10-15 at 11:48 +0300, Nicholas M Glykos wrote:
> > > For structures with a small number of reflections, the
> statistical 
> > > noise in the 5% sets can be very significant indeed. We have seen 
> > > differences between Rfree values obtained from different sets
> reaching 
> > > up to 4%.

This produces a curious paradox.

One possible reason for the variation in Rfree when choosing a different
test sets is that by pure chance reflections with more/less noise can be
selected.  Which automatically means that the working set contains
reflections with less/more noise and therefore the model (presumably)
gets better/worse.  So, selecting a test set that results in lower Rfree
leads to the model which is likely worse?

In fact, an obvious way to improve the Rfree through choice of a better
test set is by biasing it towards stronger reflections in each
resolution shell.

Selecting a test set that minimizes Rfree is so wrong on so many levels.
Unless, of course, the only thing I know about Rfree is that it is the
magic number that I need to make small by all means necessary.

Cheers,

Ed.


-- 
Oh, suddenly throwing a giraffe into a volcano to make water is crazy?
Julian, King of Lemurs


Re: [ccp4bb] should the final model be refined against full datset

2011-10-15 Thread Nicholas M Glykos
> > For structures with a small number of reflections, the statistical 
> > noise in the 5% sets can be very significant indeed. We have seen 
> > differences between Rfree values obtained from different sets reaching 
> > up to 4%.
> 
> This is very intriguing indeed! Is there something specific in these 
> structures that Rfree differences depending on the set used reach 4%? 
> NCS? Or the 5% set having less than ~1000-1500 reflections?

Tassos, by your standards, these structures should have been described as 
'tiny' and not small ... ;-)   [Yes, significantly less than 1000. In one 
case the _total_ number of reflections was 5132 reflections (which were, 
nevertheless, slowly and meticulously measured by a CAD4 one-by-one. These 
were the days ... :-)) ].




-- 


  Dr Nicholas M. Glykos, Department of Molecular Biology
 and Genetics, Democritus University of Thrace, University Campus,
  Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620,
Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/


Re: [ccp4bb] should the final model be refined against full datset

2011-10-15 Thread Anastassis Perrakis
> 
> 
> For structures with a small number of reflections, the statistical noise 
> in the 5% sets can be very significant indeed. We have seen differences 
> between Rfree values obtained from different sets reaching up to 4%. 

This is very intriguing indeed!
Is there something specific in these structures that Rfree differences depending
on the set used reach 4%? NCS? Or the 5% set having less than ~1000-1500 
reflections?

It would be indeed very interesting if there was a correlation there!

A.

> 
> Ideally, and instead of PDBSET+REFMAC we should have been using simulated 
> annealing (without positional refinement), but moving continuously between 
> the CNS-XPLOR and CCP4 was too much for my laziness.
> 
> All the best,
> Nicholas
> 
> 
> -- 
> 
> 
>  Dr Nicholas M. Glykos, Department of Molecular Biology
> and Genetics, Democritus University of Thrace, University Campus,
>  Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620,
>Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/


Re: [ccp4bb] should the final model be refined against full datset

2011-10-15 Thread Nicholas M Glykos
Dear Ethan, List,

> Surely someone must have done this!  But I can't recall ever reading
> an analysis of such a refinement protocol.  
> Does anyone know of relevant reports in the literature?

Total statistical cross validation is indeed what we should be doing, but 
for large structures the computational cost may be significant. In the 
absence of total statistical cross validation the reported Rfree may be an 
'outlier' (with respect to the distribution of the Rfree values that would 
have been obtained from all disjoined sets). To tackle this, we usually 
resort to the following ad hoc procedure :

 At an early stage of the positional refinement, we use a shell script 
which (a) uses Phil's PDBSET with the NOISE keyword to randomly shift 
atomic positions, (b) refine the resulting models with each of the 
different free sets to completion, (c) Calculate the mean of the resulting 
free R values, (d) Select (once and for all) the free set which is closer 
to the mean of the Rfree values obtained above.

For structures with a small number of reflections, the statistical noise 
in the 5% sets can be very significant indeed. We have seen differences 
between Rfree values obtained from different sets reaching up to 4%. 

Ideally, and instead of PDBSET+REFMAC we should have been using simulated 
annealing (without positional refinement), but moving continuously between 
the CNS-XPLOR and CCP4 was too much for my laziness.

All the best,
Nicholas


-- 


  Dr Nicholas M. Glykos, Department of Molecular Biology
 and Genetics, Democritus University of Thrace, University Campus,
  Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620,
Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Pavel Afonine
Hi,

yes, shifts depend on resolution indeed. See pages 75-77 here:

http://www.phenix-online.org/presentations/latest/pavel_refinement_general.pdf

Pavel

On Fri, Oct 14, 2011 at 7:34 PM, Ed Pozharski wrote:

> On Fri, 2011-10-14 at 23:41 +0100, Phil Evans wrote:
> > I just tried refining a "finished" structure turning off the FreeR
> > set, in Refmac, and I have to say I can barely see any difference
> > between the two sets of coordinates.
>
> The amplitude of the shift, I presume, depends on the resolution and
> data quality.  With a very good 1.2A dataset refined with anisotropic
> B-factors to R~14% what I see is ~0.005A rms shift.  Which is not much,
> however the reported ML DPI is ~0.02A, so perhaps the effect is not that
> small compared to the precision of the model.
>
> On the other hand, the more "normal" example at 1.7A (and very good data
> refining down to R~15%) shows ~0.03A general variation with a variable
> test set.  Again, not much, but the ML DPI in this case is ~0.06A -
> comparable to the variation induced by the choice of the test set.
>
> Cheers,
>
> Ed.
>
> --
> Hurry up, before we all come back to our senses!
>  Julian, King of Lemurs
>


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Ed Pozharski
On Fri, 2011-10-14 at 23:41 +0100, Phil Evans wrote:
> I just tried refining a "finished" structure turning off the FreeR
> set, in Refmac, and I have to say I can barely see any difference
> between the two sets of coordinates.

The amplitude of the shift, I presume, depends on the resolution and
data quality.  With a very good 1.2A dataset refined with anisotropic
B-factors to R~14% what I see is ~0.005A rms shift.  Which is not much,
however the reported ML DPI is ~0.02A, so perhaps the effect is not that
small compared to the precision of the model.  

On the other hand, the more "normal" example at 1.7A (and very good data
refining down to R~15%) shows ~0.03A general variation with a variable
test set.  Again, not much, but the ML DPI in this case is ~0.06A -
comparable to the variation induced by the choice of the test set.

Cheers,

Ed.

-- 
Hurry up, before we all come back to our senses!
  Julian, King of Lemurs


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread James Stroud
Each R-free flag corresponds a particular HKL index. Redundancy refers to the 
number of times a reflection corresponding to a given HKL index is observed. 
The final structure factor of a given HKL can be thought of as an average of 
these redundant observations.

Related to your question, someone once mentioned that for each particular space 
group, there should be a preferred R-free assignment. As far as I know, nothing 
tangible ever came of that idea.

James



On Oct 14, 2011, at 5:34 PM, D Bonsor wrote:

> I may be missing something or someone could point out that I am wrong and why 
> as I am curious, but with a highly redundant dataset the difference between 
> refining the final model against the full dataset would be small based upon 
> the random selection of reflections for Rfree? 



Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Edward A. Berry

Now it would be interesting to refine this structure to convergence,
with the original free set. If I understood correctly Ian Tickle has
done essentially this, and the Free R returns essentially to its
original value: the minimum arrived at is independent of starting
point, perhaps within  limitation that one might get caught in a
different false minimum (which is unlikely given the miniscule changes
you see). If that is the case we should stop worrying about
"corrupting" the free set by refining against it or even using it
to make maps in which models will be adjusted.
This is a perennial discussion but I never saw the report that
in fact original free-R is _not_ recoverable by refining to
convergence.

Phil Evans wrote:

I just tried refining a "finished" structure turning off the FreeR set, in 
Refmac, and I have to say I can barely see any difference between the two sets of 
coordinates.

 From this n=1 trial, I can't see that it improves the model significantly, nor 
that it ruins the model irretrievably for future purposes.

I suspect we worry too much about these things

Phil Evans



Now it would be interesting to refine this structure to convergence,
with the original free set. If I understood correctly Ian Tickle has
done essentially this, and the Free R returns essentially to its
original value: the minimum arrived at is independent of starting
point, perhaps within  limitation that one might get caught in a
different false minimum (which is unlikely given the miniscule changes
you see). If that is the case we should stop worrying about
"corrupting" the free set by refining against it or even using it
to make maps in which models will be adjusted.
This is a perennial discussion but I never saw the report that
in fact original free-R is _not_ recoverable by refining to
convergence.
Indeed, perhaps we worry too much about such things.



On 14 Oct 2011, at 21:35, Nat Echols wrote:


On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang  wrote:
Sorry, I don't quite understand your reasoning for how the structure is 
rendered useless if one refined it with all data.

"Useless" was too strong a word (it's Friday, sorry).  I guess simulated 
annealing can address the model-bias issue, but I'm not totally convinced that this 
solves the problem.  And not every crystallographer will run SA every time he/she solves 
an isomorphous structure, so there's a real danger of misleading future users of the PDB 
file.  The reported R-free, of course, is still meaningless in the context of the 
deposited model.

Would your argument also apply to all the structures that were refined before 
R-free existed?

Technically, yes - but how many proteins are there whose only representatives 
in the PDB were refined this way?  I suspect very few; in most cases, a more 
recent model should be available.

-Nat




Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Thomas C. Terwilliger
Dear Gerard,

I'm very happy for the discussion to be on the CCP4 list (or on the IUCR
forums, or both).  I was only trying to not create too much traffic.

All the best,
Tom T

>> Dear Tom,
>>
>>  I am not sure that I feel happy with your invitation that views on
>> such
>> crucial matters as these deposition issues be communicated to you
>> off-list.
>> It would seem much healthier if these views were aired out within the BB.
>> Again!, some will say ... but the difference is that there is now a forum
>> for them, set up by the IUCr, that may eventually turn opinions into some
>> form of action.
>>
>>  I am sure that many subscribers to this BB, and not just you as a
>> member of some committees, would be interested to hear the full variety of
>> views on the desirable and the feasible in these areas, and to express
>> their
>> own for everyone to read and discuss.
>>
>>  Perhaps John Helliwell can elaborate on this and on the newly created
>> forum.
>>
>>
>>  With best wishes,
>>
>>   Gerard.
>>
>> --
>> On Fri, Oct 14, 2011 at 04:56:20PM -0600, Thomas C. Terwilliger wrote:
>>> For those who have strong opinions on what data should be deposited...
>>>
>>> The IUCR is just starting a serious discussion of this subject. Two
>>> committees, the "Data Deposition Working Group", led by John Helliwell,
>>> and the Commission on Biological Macromolecules (chaired by Xiao-Dong
>>> Su)
>>> are working on this.
>>>
>>> Two key issues are (1) feasibility and importance of deposition of raw
>>> images and (2) deposition of sufficient information to fully reproduce
>>> the
>>> crystallographic analysis.
>>>
>>> I am on both committees and would be happy to hear your ideas
>>> (off-list).
>>> I am sure the other members of the committees would welcome your
>>> thoughts
>>> as well.
>>>
>>> -Tom T
>>>
>>> Tom Terwilliger
>>> terwilli...@lanl.gov
>>>
>>>
>>> >> This is a follow up (or a digression) to James comparing test set to
>>> >> missing reflections.  I also heard this issue mentioned before but
>>> was
>>> >> always too lazy to actually pursue it.
>>> >>
>>> >> So.
>>> >>
>>> >> The role of the test set is to prevent overfitting.  Let's say I have
>>> >> the final model and I monitored the Rfree every step of the way and
>>> can
>>> >> conclude that there is no overfitting.  Should I do the final
>>> refinement
>>> >> against complete dataset?
>>> >>
>>> >> IMCO, I absolutely should.  The test set reflections contain
>>> >> information, and the "final" model is actually biased towards the
>>> >> working set.  Refining using all the data can only improve the
>>> accuracy
>>> >> of the model, if only slightly.
>>> >>
>>> >> The second question is practical.  Let's say I want to deposit the
>>> >> results of the refinement against the full dataset as my final model.
>>> >> Should I not report the Rfree and instead insert a remark explaining
>>> the
>>> >> situation?  If I report the Rfree prior to the test set removal, it
>>> is
>>> >> certain that every validation tool will report a mismatch.  It does
>>> not
>>> >> seem that the PDB has a mechanism to deal with this.
>>> >>
>>> >> Cheers,
>>> >>
>>> >> Ed.
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Oh, suddenly throwing a giraffe into a volcano to make water is
>>> crazy?
>>> >> Julian, King of
>>> Lemurs
>>> >>
>>
>> --
>>
>>  ===
>>  * *
>>  * Gerard Bricogne g...@globalphasing.com  *
>>  * *
>>  * Global Phasing Ltd. *
>>  * Sheraton House, Castle Park Tel: +44-(0)1223-353033 *
>>  * Cambridge CB3 0AX, UK   Fax: +44-(0)1223-366889 *
>>  * *
>>  ===
>>


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread D Bonsor
I may be missing something or someone could point out that I am wrong and why 
as I am curious, but with a highly redundant dataset the difference between 
refining the final model against the full dataset would be small based upon the 
random selection of reflections for Rfree? 


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Gerard Bricogne
Dear Tom,

 I am not sure that I feel happy with your invitation that views on such
crucial matters as these deposition issues be communicated to you off-list.
It would seem much healthier if these views were aired out within the BB. 
Again!, some will say ... but the difference is that there is now a forum
for them, set up by the IUCr, that may eventually turn opinions into some
form of action.

 I am sure that many subscribers to this BB, and not just you as a
member of some committees, would be interested to hear the full variety of
views on the desirable and the feasible in these areas, and to express their
own for everyone to read and discuss.

 Perhaps John Helliwell can elaborate on this and on the newly created
forum.


 With best wishes,
 
  Gerard.

--
On Fri, Oct 14, 2011 at 04:56:20PM -0600, Thomas C. Terwilliger wrote:
> For those who have strong opinions on what data should be deposited...
> 
> The IUCR is just starting a serious discussion of this subject. Two
> committees, the "Data Deposition Working Group", led by John Helliwell,
> and the Commission on Biological Macromolecules (chaired by Xiao-Dong Su)
> are working on this.
> 
> Two key issues are (1) feasibility and importance of deposition of raw
> images and (2) deposition of sufficient information to fully reproduce the
> crystallographic analysis.
> 
> I am on both committees and would be happy to hear your ideas (off-list). 
> I am sure the other members of the committees would welcome your thoughts
> as well.
> 
> -Tom T
> 
> Tom Terwilliger
> terwilli...@lanl.gov
> 
> 
> >> This is a follow up (or a digression) to James comparing test set to
> >> missing reflections.  I also heard this issue mentioned before but was
> >> always too lazy to actually pursue it.
> >>
> >> So.
> >>
> >> The role of the test set is to prevent overfitting.  Let's say I have
> >> the final model and I monitored the Rfree every step of the way and can
> >> conclude that there is no overfitting.  Should I do the final refinement
> >> against complete dataset?
> >>
> >> IMCO, I absolutely should.  The test set reflections contain
> >> information, and the "final" model is actually biased towards the
> >> working set.  Refining using all the data can only improve the accuracy
> >> of the model, if only slightly.
> >>
> >> The second question is practical.  Let's say I want to deposit the
> >> results of the refinement against the full dataset as my final model.
> >> Should I not report the Rfree and instead insert a remark explaining the
> >> situation?  If I report the Rfree prior to the test set removal, it is
> >> certain that every validation tool will report a mismatch.  It does not
> >> seem that the PDB has a mechanism to deal with this.
> >>
> >> Cheers,
> >>
> >> Ed.
> >>
> >>
> >>
> >> --
> >> Oh, suddenly throwing a giraffe into a volcano to make water is crazy?
> >> Julian, King of Lemurs
> >>

-- 

 ===
 * *
 * Gerard Bricogne g...@globalphasing.com  *
 * *
 * Global Phasing Ltd. *
 * Sheraton House, Castle Park Tel: +44-(0)1223-353033 *
 * Cambridge CB3 0AX, UK   Fax: +44-(0)1223-366889 *
 * *
 ===


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Thomas C. Terwilliger
For those who have strong opinions on what data should be deposited...

The IUCR is just starting a serious discussion of this subject. Two
committees, the "Data Deposition Working Group", led by John Helliwell,
and the Commission on Biological Macromolecules (chaired by Xiao-Dong Su)
are working on this.

Two key issues are (1) feasibility and importance of deposition of raw
images and (2) deposition of sufficient information to fully reproduce the
crystallographic analysis.

I am on both committees and would be happy to hear your ideas (off-list). 
I am sure the other members of the committees would welcome your thoughts
as well.

-Tom T

Tom Terwilliger
terwilli...@lanl.gov


>> This is a follow up (or a digression) to James comparing test set to
>> missing reflections.  I also heard this issue mentioned before but was
>> always too lazy to actually pursue it.
>>
>> So.
>>
>> The role of the test set is to prevent overfitting.  Let's say I have
>> the final model and I monitored the Rfree every step of the way and can
>> conclude that there is no overfitting.  Should I do the final refinement
>> against complete dataset?
>>
>> IMCO, I absolutely should.  The test set reflections contain
>> information, and the "final" model is actually biased towards the
>> working set.  Refining using all the data can only improve the accuracy
>> of the model, if only slightly.
>>
>> The second question is practical.  Let's say I want to deposit the
>> results of the refinement against the full dataset as my final model.
>> Should I not report the Rfree and instead insert a remark explaining the
>> situation?  If I report the Rfree prior to the test set removal, it is
>> certain that every validation tool will report a mismatch.  It does not
>> seem that the PDB has a mechanism to deal with this.
>>
>> Cheers,
>>
>> Ed.
>>
>>
>>
>> --
>> Oh, suddenly throwing a giraffe into a volcano to make water is crazy?
>> Julian, King of Lemurs
>>


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Phil Evans
I just tried refining a "finished" structure turning off the FreeR set, in 
Refmac, and I have to say I can barely see any difference between the two sets 
of coordinates.

From this n=1 trial, I can't see that it improves the model significantly, nor 
that it ruins the model irretrievably for future purposes.   

I suspect we worry too much about these things

Phil Evans

On 14 Oct 2011, at 21:35, Nat Echols wrote:

> On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang  wrote:
> Sorry, I don't quite understand your reasoning for how the structure is 
> rendered useless if one refined it with all data.
> 
> "Useless" was too strong a word (it's Friday, sorry).  I guess simulated 
> annealing can address the model-bias issue, but I'm not totally convinced 
> that this solves the problem.  And not every crystallographer will run SA 
> every time he/she solves an isomorphous structure, so there's a real danger 
> of misleading future users of the PDB file.  The reported R-free, of course, 
> is still meaningless in the context of the deposited model.
> 
> Would your argument also apply to all the structures that were refined before 
> R-free existed?
> 
> Technically, yes - but how many proteins are there whose only representatives 
> in the PDB were refined this way?  I suspect very few; in most cases, a more 
> recent model should be available.
> 
> -Nat


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Ethan Merritt
On Friday, October 14, 2011 02:45:08 pm Ed Pozharski wrote:
> On Fri, 2011-10-14 at 13:07 -0700, Nat Echols wrote:
> > 
> > The benefit of including those extra 5% of data is always minimal 
> 
> And so is probably the benefit of excluding when all the steps that
> require cross-validation have already been performed.  My thinking is
> that excluding data from analysis should always be justified (and in the
> initial stages of refinement, it might be as it prevents overfitting),
> not the other way around.

A model with error bars is more useful than a marginally more
accurate model without error bars, not least because you are probably
taking it on faith that the second model is "more accurate".

Crystallographers were kind of late in realizing that a cross validation
test could be useful in assessing refinement.  What's more, we 
never really learned the whole lesson.  Rather than using the full
test, we use only one blade of the jackknife.  

http://en.wikipedia.org/wiki/Cross-validation_(statistics)#K-fold_cross-validation

The full test would involve running multiple parallel refinements, 
each one omiting a different disjoint set of reflections.  
The ccp4 suite is set up to do this,  since Rfree flags by default run 
from 0-19 and refmac lets you specify which 5% subset is to be omitted
from the current run. Of course, evaluating the end point becomes more
complex than looking at a single number "Rfree".

Surely someone must have done this!  But I can't recall ever reading
an analysis of such a refinement protocol.  
Does anyone know of relevant reports in the literature?

Is there a program or script that will collect K-fold parallel output
models and their residuals to generate a net indicator of model quality?

Ethan

-- 
Ethan A Merritt
Biomolecular Structure Center,  K-428 Health Sciences Bldg
University of Washington, Seattle 98195-7742


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Quyen Hoang

Thanks for the clear explanation. I understood that.
But I was trying to understand how this would negatively affects the  
initial model to render it useless or less useful.
In the scenario that you presented, I would expect a better result  
(better model) if the initial model was refined with all data, thus  
more useful.
Sure, again in your scenario, the "new" structure has seen R-free  
reflections in the equivalent indexes of its replacement model, but  
their intensities should be different anyway, so I am not sure how  
this is bad. Even if the bias is huge, let's say this bias results in  
1% reduction in initial R-free (exaggerating here), how would this  
makes one's model bad or how would this be bad for one's science?
In the end, our objective is to build the best model possible and I  
think that more data would likely result in better model, not the  
other way around. If we can agree that refining a model with all data  
would result in a better model, then wouldn't not doing so constitute  
a compromise of model quality for a more "pure" statistic?


I had not refined a model with all data before (just to keep inline),  
but I  wondered if I was doing the best thing.


Cheers,
Quyen
On Oct 14, 2011, at 5:27 PM, Phil Jeffrey wrote:

Let's say you have two isomorphous crystals of two different protein- 
ligand complexes.  Same protein different ligand, same xtal form.   
Conventionally you'd keep the same free set reflections (hkl values)  
between the two datasets to reduce biasing.  However if the first  
model had been refined against all reflections there is no longer a  
free set for that model, thus all hkl's have seen the atoms during  
refinement, and so your R-free in the second complex is initially  
biased to the model from the first complex. [*]


The tendency is to do less refinement in these sort of isomorphous  
cases than in molecular replacement solutions, because the  
structural changes are usually far less (it is isomorphous after  
all) so there's a risk that the R-free will not be allowed to fully  
float free of that initial bias.  That makes your R-free look better  
than it actually is.


This is rather strongly analogous to using different free sets in  
the two datasets.


However I'm not sure that this is as big of a deal as it is being  
made to sound.  It can be dealt with straightforwardly.  However  
refining against all the data weakens the use of R-free as a  
validation tool for that particular model so the people that like to  
judge structures based on a single number (i.e. R-free) are going to  
be quite put out.


It's also the case that the best model probably *is* the one based  
on a careful last round of refinement against all data, as long as  
nothing much changes.  That would need to be quantified in some  
way(s).


Phil Jeffrey
Princeton

[* Your R-free is also initially model-biased in cases where the  
data are significant non-isomorphous or you're using two different  
xtal forms, to varying extents]





I still don't understand how a structure model refined with all data
would negatively affect the determination and/or refinement of an
isomorphous structure using a different data set (even without  
doing SA

first).

Quyen

On Oct 14, 2011, at 4:35 PM, Nat Echols wrote:


On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang mailto:qqho...@gmail.com>> wrote:

   Sorry, I don't quite understand your reasoning for how the
   structure is rendered useless if one refined it with all data.


"Useless" was too strong a word (it's Friday, sorry). I guess
simulated annealing can address the model-bias issue, but I'm not
totally convinced that this solves the problem. And not every
crystallographer will run SA every time he/she solves an isomorphous
structure, so there's a real danger of misleading future users of  
the
PDB file. The reported R-free, of course, is still meaningless in  
the

context of the deposited model.

   Would your argument also apply to all the structures that were
   refined before R-free existed?


Technically, yes - but how many proteins are there whose only
representatives in the PDB were refined this way? I suspect very  
few;

in most cases, a more recent model should be available.

-Nat






Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Craig A. Bingman
We have obligations that extend beyond simply presenting a "best" model.  

In an ideal world, the PDB would accept two coordinate sets and two sets of 
statistics, one for the last step where the cross-validation set was valid, and 
a final model refined against all the data.  Until there is a clear way to do 
that, and an unambiguous presentation of them to the public, IMO, the gains won 
by refinement against all the data are outweighed by the Confusion that it can 
cause when presenting model and associated statistics to the public.


On Oct 14, 2011, at 3:32 PM, Jan Dohnalek wrote:

> Regarding refinement against all reflections: the main goal of our work is to 
> provide the best possible representation of the experimental data in the form 
> of the structure model. Once the structure building and refinement process is 
> finished keeping the Rfree set separate does not make sense any more. Its 
> role finishes once the last set of changes have been done to the model and 
> verified ...
> 
> J. Dohnalek


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Ed Pozharski
On Fri, 2011-10-14 at 13:07 -0700, Nat Echols wrote:

> You should enter the statistics for the model and data that you
> actually deposit, not statistics for some other model that you might
> have had at one point but which the PDB will never see.  

If you read my post carefully, you'll see that I never suggested
reporting statistics for one model and depositing the other

> Not only does refining against R-free make it impossible to verify and
> validate your structure, it also means that any time you or anyone
> else wants to solve an isomorphous structure by MR using your
> structure as a search model, or continue the refinement with
> higher-resolution data, you will be starting with a model that has
> been refined against all reflections.  So any future refinements done
> with that model against isomorphous data are pre-biased, making your
> model potentially useless.

Frankly, I think you are exaggerating the magnitude of model bias in the
situation that I described.  You assume that the refinement will become
severely unstable after tossing in the test reflections.  Depending on
the resolution etc, the rms shift of the model may vary but if it even
is, say half an angstrom the model hardly becomes useless (and that is
hugely overestimated).  And at least in theory including *all the data*
should make the model more, not less accurate.

> The benefit of including those extra 5% of data is always minimal 

And so is probably the benefit of excluding when all the steps that
require cross-validation have already been performed.  My thinking is
that excluding data from analysis should always be justified (and in the
initial stages of refinement, it might be as it prevents overfitting),
not the other way around.

Cheers,

Ed.

-- 
"Hurry up before we all come back to our senses!"
   Julian, King of Lemurs


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Felix Frolow
Recently we (I mean WE - community) frequently refine structures around 1 
Angstrom resolution.
This is not what  for the Rfree was invented. It was invented to go away with 
3.0-2.8 Angstrom data
in times when people did not possess  facilities good enough to look on the 
electron density maps….
We finish (WE - I again mean - community) the refinement of our structures too 
early.

Dr Felix Frolow   
Professor of Structural Biology and Biotechnology
Department of Molecular Microbiology
and Biotechnology
Tel Aviv University 69978, Israel

Acta Crystallographica F, co-editor

e-mail: mbfro...@post.tau.ac.il
Tel:  ++972-3640-8723
Fax: ++972-3640-9407
Cellular: 0547 459 608

On Oct 14, 2011, at 22:35 , Nat Echols wrote:

> On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang  wrote:
> Sorry, I don't quite understand your reasoning for how the structure is 
> rendered useless if one refined it with all data.
> 
> "Useless" was too strong a word (it's Friday, sorry).  I guess simulated 
> annealing can address the model-bias issue, but I'm not totally convinced 
> that this solves the problem.  And not every crystallographer will run SA 
> every time he/she solves an isomorphous structure, so there's a real danger 
> of misleading future users of the PDB file.  The reported R-free, of course, 
> is still meaningless in the context of the deposited model.
> 
> Would your argument also apply to all the structures that were refined before 
> R-free existed?
> 
> Technically, yes - but how many proteins are there whose only representatives 
> in the PDB were refined this way?  I suspect very few; in most cases, a more 
> recent model should be available.
> 
> -Nat



Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Phil Jeffrey
Let's say you have two isomorphous crystals of two different 
protein-ligand complexes.  Same protein different ligand, same xtal 
form.  Conventionally you'd keep the same free set reflections (hkl 
values) between the two datasets to reduce biasing.  However if the 
first model had been refined against all reflections there is no longer 
a free set for that model, thus all hkl's have seen the atoms during 
refinement, and so your R-free in the second complex is initially biased 
to the model from the first complex. [*]


The tendency is to do less refinement in these sort of isomorphous cases 
than in molecular replacement solutions, because the structural changes 
are usually far less (it is isomorphous after all) so there's a risk 
that the R-free will not be allowed to fully float free of that initial 
bias.  That makes your R-free look better than it actually is.


This is rather strongly analogous to using different free sets in the 
two datasets.


However I'm not sure that this is as big of a deal as it is being made 
to sound.  It can be dealt with straightforwardly.  However refining 
against all the data weakens the use of R-free as a validation tool for 
that particular model so the people that like to judge structures based 
on a single number (i.e. R-free) are going to be quite put out.


It's also the case that the best model probably *is* the one based on a 
careful last round of refinement against all data, as long as nothing 
much changes.  That would need to be quantified in some way(s).


Phil Jeffrey
Princeton

[* Your R-free is also initially model-biased in cases where the data 
are significant non-isomorphous or you're using two different xtal 
forms, to varying extents]





I still don't understand how a structure model refined with all data
would negatively affect the determination and/or refinement of an
isomorphous structure using a different data set (even without doing SA
first).

Quyen

On Oct 14, 2011, at 4:35 PM, Nat Echols wrote:


On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang mailto:qqho...@gmail.com>> wrote:

Sorry, I don't quite understand your reasoning for how the
structure is rendered useless if one refined it with all data.


"Useless" was too strong a word (it's Friday, sorry). I guess
simulated annealing can address the model-bias issue, but I'm not
totally convinced that this solves the problem. And not every
crystallographer will run SA every time he/she solves an isomorphous
structure, so there's a real danger of misleading future users of the
PDB file. The reported R-free, of course, is still meaningless in the
context of the deposited model.

Would your argument also apply to all the structures that were
refined before R-free existed?


Technically, yes - but how many proteins are there whose only
representatives in the PDB were refined this way? I suspect very few;
in most cases, a more recent model should be available.

-Nat




Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Quyen Hoang
I still don't understand how a structure model refined with all data  
would negatively affect the determination and/or refinement of an  
isomorphous structure using a different data set (even without doing  
SA first).


Quyen

On Oct 14, 2011, at 4:35 PM, Nat Echols wrote:

On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang   
wrote:
Sorry, I don't quite understand your reasoning for how the structure  
is rendered useless if one refined it with all data.


"Useless" was too strong a word (it's Friday, sorry).  I guess  
simulated annealing can address the model-bias issue, but I'm not  
totally convinced that this solves the problem.  And not every  
crystallographer will run SA every time he/she solves an isomorphous  
structure, so there's a real danger of misleading future users of  
the PDB file.  The reported R-free, of course, is still meaningless  
in the context of the deposited model.


Would your argument also apply to all the structures that were  
refined before R-free existed?


Technically, yes - but how many proteins are there whose only  
representatives in the PDB were refined this way?  I suspect very  
few; in most cases, a more recent model should be available.


-Nat




Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Nat Echols
On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang  wrote:

> Sorry, I don't quite understand your reasoning for how the structure is
> rendered useless if one refined it with all data.
>

"Useless" was too strong a word (it's Friday, sorry).  I guess simulated
annealing can address the model-bias issue, but I'm not totally convinced
that this solves the problem.  And not every crystallographer will run SA
every time he/she solves an isomorphous structure, so there's a real danger
of misleading future users of the PDB file.  The reported R-free, of course,
is still meaningless in the context of the deposited model.

Would your argument also apply to all the structures that were refined
> before R-free existed?


Technically, yes - but how many proteins are there whose only
representatives in the PDB were refined this way?  I suspect very few; in
most cases, a more recent model should be available.

-Nat


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Jan Dohnalek
Regarding refinement against all reflections: the main goal of our work is
to provide the best possible representation of the experimental data in the
form of the structure model. Once the structure building and refinement
process is finished keeping the Rfree set separate does not make sense any
more. Its role finishes once the last set of changes have been done to the
model and verified ...

J. Dohnalek


On Fri, Oct 14, 2011 at 10:23 PM, Craig A. Bingman <
cbing...@biochem.wisc.edu> wrote:

> Recent experience indicates that the PDB is checking these statistics very
> closely for new depositions.  The checks made by the PDB are intended to
> prevent accidents and oversights made by honest people from creeping into
> the database.  "Getting away" with something seems to imply some intention
> to deceive, and that is much more difficult to detect.
>
> On Oct 14, 2011, at 3:09 PM, Robbie Joosten wrote:
>
> The deposited R-free sets in the PDB are quite frequently 'unfree' or the
> wrong set was deposited (checking this is one of the recommendations in the
> VTF report in Structure). So at the moment you would probably get away with
> depositing an unfree R-free set ;)
>
>
>


-- 
Jan Dohnalek, Ph.D
Institute of Macromolecular Chemistry
Academy of Sciences of the Czech Republic
Heyrovskeho nam. 2
16206 Praha 6
Czech Republic

Tel: +420 296 809 390
Fax: +420 296 809 410


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Craig A. Bingman
Recent experience indicates that the PDB is checking these statistics very 
closely for new depositions.  The checks made by the PDB are intended to 
prevent accidents and oversights made by honest people from creeping into the 
database.  "Getting away" with something seems to imply some intention to 
deceive, and that is much more difficult to detect.

On Oct 14, 2011, at 3:09 PM, Robbie Joosten wrote:

> The deposited R-free sets in the PDB are quite frequently 'unfree' or the 
> wrong set was deposited (checking this is one of the recommendations in the 
> VTF report in Structure). So at the moment you would probably get away with 
> depositing an unfree R-free set ;)
> 



Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Quyen Hoang
Sorry, I don't quite understand your reasoning for how the structure  
is rendered useless if one refined it with all data.
Would your argument also apply to all the structures that were refined  
before R-free existed?


Quyen



You should enter the statistics for the model and data that you  
actually deposit, not statistics for some other model that you might  
have had at one point but which the PDB will never see.  Not only  
does refining against R-free make it impossible to verify and  
validate your structure, it also means that any time you or anyone  
else wants to solve an isomorphous structure by MR using your  
structure as a search model, or continue the refinement with higher- 
resolution data, you will be starting with a model that has been  
refined against all reflections.  So any future refinements done  
with that model against isomorphous data are pre-biased, making your  
model potentially useless.


I'm amazed that anyone is still depositing structures refined  
against all data, but the PDB does still get a few.  The benefit of  
including those extra 5% of data is always minimal in every paper  
I've seen that reports such a procedure, and far outweighed by  
having a reliable and relatively unbiased validation statistic that  
is preserved in the final deposition.  (The situation may be  
different for very low resolution data, but those structures are a  
tiny fraction of the PDB.)


-Nat


Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Robbie Joosten

Hi Ed,
 

> This is a follow up (or a digression) to James comparing test set to
> missing reflections. I also heard this issue mentioned before but was
> always too lazy to actually pursue it.
> 
> So.
> 
> The role of the test set is to prevent overfitting. Let's say I have
> the final model and I monitored the Rfree every step of the way and can
> conclude that there is no overfitting. Should I do the final refinement
> against complete dataset?
> 
> IMCO, I absolutely should. The test set reflections contain
> information, and the "final" model is actually biased towards the
> working set. Refining using all the data can only improve the accuracy
> of the model, if only slightly.
Hmm, if your R-free set is small the added value will also be small. If it is 
relatively big, then your previously established optimal weights may no longer 
be optimal. A more elegant thing to would be refine the model with, say, 20 
different 5% R-free sets, deposit the ensemble and report the average R(-free) 
plus a standard deviation. AFAIK, this is what the R-free set numbers that 
CCP4's FREERFLAG generates are for. Of course, in that case you should do 
enough refinement (and perhaps rebuilding) to make sure each R-free set is 
free. 

> The second question is practical. Let's say I want to deposit the
> results of the refinement against the full dataset as my final model.
> Should I not report the Rfree and instead insert a remark explaining the
> situation? If I report the Rfree prior to the test set removal, it is
> certain that every validation tool will report a mismatch. It does not
> seem that the PDB has a mechanism to deal with this.
The deposited R-free sets in the PDB are quite frequently 'unfree' or the wrong 
set was deposited (checking this is one of the recommendations in the VTF 
report in Structure). So at the moment you would probably get away with 
depositing an unfree R-free set ;)
 
Cheers,
Robbie
 
 
> 
> Cheers,
> 
> Ed.
> 
> 
> 
> -- 
> Oh, suddenly throwing a giraffe into a volcano to make water is crazy?
> Julian, King of Lemurs
  

Re: [ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Nat Echols
On Fri, Oct 14, 2011 at 12:52 PM, Ed Pozharski wrote:

> The second question is practical.  Let's say I want to deposit the
> results of the refinement against the full dataset as my final model.
> Should I not report the Rfree and instead insert a remark explaining the
> situation?  If I report the Rfree prior to the test set removal, it is
> certain that every validation tool will report a mismatch.  It does not
> seem that the PDB has a mechanism to deal with this.
>

You should enter the statistics for the model and data that you actually
deposit, not statistics for some other model that you might have had at one
point but which the PDB will never see.  Not only does refining against
R-free make it impossible to verify and validate your structure, it also
means that any time you or anyone else wants to solve an isomorphous
structure by MR using your structure as a search model, or continue the
refinement with higher-resolution data, you will be starting with a model
that has been refined against all reflections.  So any future refinements
done with that model against isomorphous data are pre-biased, making your
model potentially useless.

I'm amazed that anyone is still depositing structures refined against all
data, but the PDB does still get a few.  The benefit of including those
extra 5% of data is always minimal in every paper I've seen that reports
such a procedure, and far outweighed by having a reliable and relatively
unbiased validation statistic that is preserved in the final deposition.
 (The situation may be different for very low resolution data, but those
structures are a tiny fraction of the PDB.)

-Nat


[ccp4bb] should the final model be refined against full datset

2011-10-14 Thread Ed Pozharski
This is a follow up (or a digression) to James comparing test set to
missing reflections.  I also heard this issue mentioned before but was
always too lazy to actually pursue it.

So.

The role of the test set is to prevent overfitting.  Let's say I have
the final model and I monitored the Rfree every step of the way and can
conclude that there is no overfitting.  Should I do the final refinement
against complete dataset?

IMCO, I absolutely should.  The test set reflections contain
information, and the "final" model is actually biased towards the
working set.  Refining using all the data can only improve the accuracy
of the model, if only slightly.

The second question is practical.  Let's say I want to deposit the
results of the refinement against the full dataset as my final model.
Should I not report the Rfree and instead insert a remark explaining the
situation?  If I report the Rfree prior to the test set removal, it is
certain that every validation tool will report a mismatch.  It does not
seem that the PDB has a mechanism to deal with this.

Cheers,

Ed.



-- 
Oh, suddenly throwing a giraffe into a volcano to make water is crazy?
Julian, King of Lemurs