Re: [ccp4bb] How many is too many free reflections?

2015-06-16 Thread dusan turk
Dear Axel and Paul,

Thank you for reopening the Rfree and TEST set discussion. The concept of Rfree 
and TEST set play an important role crystallography. When you introduced them 
back in 1992, Rfree was the first systematic method of structure validation. 
Its advantage is that it can use data from the structure being determined in 
the absence of any other data sources. Nowadays, over two decades after, we 
have learned a lot about the structures. Real space approaches from density 
fit, over deviation from ideal, statistically derived geometrical restraints to 
packing information together provide insight in the structure correctness, 
which in my opinion ensure the structure correctness and over interpretation. 
Not to mention the relevance of the fit structure factors of model (Fmodel) to 
the measured data (Fobs) by R-factor (R-work).

While the concept of the TEST set und its use in refinement provided a simple 
criterion for structure validation, it raises the following concerns:

- Refining structures against incomplete data results in structures which are 
off the true target. Namely, the information of reflections omitted from the 
WORK set is introducing bias of their absence. This bias is direct consequence 
of orthogonality of the Fourier series terms. The bias of absence is diminished 
by reducing the amount of data included in the TEST set, but nevertheless 
remains with its presence. In time the portion and size of the TEST was 
diminished substantially.

- The identification of TEST reflections faces the problem of their 
independency, when identical subunits present in a structure are related by 
NCS. I think that a substantial proportion of structures contains NCS. An 
interesting angle on NCS issue is provided by the work of Silva  Rossmann 
(1985), who discarded most of data almost proportionally to the level of NCS 
redundancy (using 1/7th for WORK set and 6/7 for TEST set in the case of 
10-fold NCS).

- An additional, so far almost neglected concern is the cross propagation of 
systematic errors in structures. They are a consequence of interactions of 
structural parts through the chemical bonding and nonbonding energy terms used 
in refinement. Absence of consideration of errors of this origin results in too 
small coordinate error estimates essential for the Maximum Likelihood (ML) 
function.

- The original use of the TEST set in refinement used the Least Square target, 
apart from the bias of its absence, does not effect the Least Square target 
itself, whereas the standard ML function relies on this data and is therefore 
biased by them.

- The Rfree is an indicator of structure correctness and is monitored during 
refinement to assure its decrease, however a different choice of TEST set will 
result in a different phase error and gap between Rfree and Rwork. The 
relationship of the Rfree and Rwork gap and the phase error between different 
tests sets calculated on our 5 test cases with 4 different portions of 31 
different TEST sets is either statistically significant or insignificant. Both 
groups are contain approximately equal number of members. When the relationship 
turned statistically significant it happened that the lower Rfree Rwork gap 
quite oftne deliver higher phase error. (This part of analysis was not included 
in the paper, however the negative correlation may be seen in the trend of the 
orange dots in several graphs on Figure 6.) Hence, there is no warranty that 
the TEST set with the lowest gap between Rfree and Rwork will deliver also the 
structure with the lowest phase error, which is an underlying assumption of the 
use of Rfree for the purpose of structure validation. This suggests that the 
gap between Rfree and Rwork can be easily manipulated and manipulation not 
spotted.  In the absence of the reference structure it is namely impossible to 
discover which choice of the TEST set and the corresponding gap between Rfree 
and Rwork delivers the structure with the lowest phase error. (This argument in 
way provides support for the Gerard's point that the TEST set may not be 
exchanged when various structures of the same crystal form of a molecule are 
being determined using the Rfree methodology.) The “trick” of exchange of the 
TEST set is no surprise to the community which uses it at the occasions, when 
they suspect that a too large gap between R-free and R-work may lead to 
potential problems with a stubborn referee.

To overcome these concerns we developed the Maximum Likelihood Free Kick 
function (ML FK). As the cases used in the paper indicate, ML FK target 
function delivered more accurate structures and narrower solutions than the 
todays standard Maximum Likelihood Cross Validation (ML CV) function in all 
tested cases including the case of 2AHN structure build in the wrong direction.

Our understanding is that the role of Rfree should be considered from the 
historical perspective. In our paper we wrote “Regarding the use of Rfree to 

Re: [ccp4bb] How many is too many free reflections?

2015-06-16 Thread James Phillips
What is wrong with using Rfree until the very late stages of refinement,
then alternating refinements with all reflections and Rfree reflections
while not introducing more refinement parameters.

This way you would get a structure and e-map based on all the data while
ensuring that the data has not been overfitted.

Just a thought.

On Tue, Jun 16, 2015 at 8:07 AM, dusan turk dusan.t...@ijs.si wrote:

 Dear Axel and Paul,

 Thank you for reopening the Rfree and TEST set discussion. The concept of
 Rfree and TEST set play an important role crystallography. When you
 introduced them back in 1992, Rfree was the first systematic method of
 structure validation. Its advantage is that it can use data from the
 structure being determined in the absence of any other data sources.
 Nowadays, over two decades after, we have learned a lot about the
 structures. Real space approaches from density fit, over deviation from
 ideal, statistically derived geometrical restraints to packing information
 together provide insight in the structure correctness, which in my opinion
 ensure the structure correctness and over interpretation. Not to mention
 the relevance of the fit structure factors of model (Fmodel) to the
 measured data (Fobs) by R-factor (R-work).

 While the concept of the TEST set und its use in refinement provided a
 simple criterion for structure validation, it raises the following concerns:

 - Refining structures against incomplete data results in structures which
 are off the true target. Namely, the information of reflections omitted
 from the WORK set is introducing bias of their absence. This bias is direct
 consequence of orthogonality of the Fourier series terms. The bias of
 absence is diminished by reducing the amount of data included in the TEST
 set, but nevertheless remains with its presence. In time the portion and
 size of the TEST was diminished substantially.

 - The identification of TEST reflections faces the problem of their
 independency, when identical subunits present in a structure are related by
 NCS. I think that a substantial proportion of structures contains NCS. An
 interesting angle on NCS issue is provided by the work of Silva  Rossmann
 (1985), who discarded most of data almost proportionally to the level of
 NCS redundancy (using 1/7th for WORK set and 6/7 for TEST set in the case
 of 10-fold NCS).

 - An additional, so far almost neglected concern is the cross propagation
 of systematic errors in structures. They are a consequence of interactions
 of structural parts through the chemical bonding and nonbonding energy
 terms used in refinement. Absence of consideration of errors of this origin
 results in too small coordinate error estimates essential for the Maximum
 Likelihood (ML) function.

 - The original use of the TEST set in refinement used the Least Square
 target, apart from the bias of its absence, does not effect the Least
 Square target itself, whereas the standard ML function relies on this data
 and is therefore biased by them.

 - The Rfree is an indicator of structure correctness and is monitored
 during refinement to assure its decrease, however a different choice of
 TEST set will result in a different phase error and gap between Rfree and
 Rwork. The relationship of the Rfree and Rwork gap and the phase error
 between different tests sets calculated on our 5 test cases with 4
 different portions of 31 different TEST sets is either statistically
 significant or insignificant. Both groups are contain approximately equal
 number of members. When the relationship turned statistically significant
 it happened that the lower Rfree Rwork gap quite oftne deliver higher phase
 error. (This part of analysis was not included in the paper, however the
 negative correlation may be seen in the trend of the orange dots in several
 graphs on Figure 6.) Hence, there is no warranty that the TEST set with the
 lowest gap between Rfree and Rwork will deliver also the structure with the
 lowest phase error, which is an underlying assumption of the use of Rfree
 for the purpose of structure validation. This suggests that the gap between
 Rfree and Rwork can be easily manipulated and manipulation not spotted.  In
 the absence of the reference structure it is namely impossible to discover
 which choice of the TEST set and the corresponding gap between Rfree and
 Rwork delivers the structure with the lowest phase error. (This argument in
 way provides support for the Gerard's point that the TEST set may not be
 exchanged when various structures of the same crystal form of a molecule
 are being determined using the Rfree methodology.) The “trick” of exchange
 of the TEST set is no surprise to the community which uses it at the
 occasions, when they suspect that a too large gap between R-free and R-work
 may lead to potential problems with a stubborn referee.

 To overcome these concerns we developed the Maximum Likelihood Free Kick
 function (ML FK). As the cases used in the paper 

Re: [ccp4bb] How many is too many free reflections?

2015-06-10 Thread Axel Brunger
Dear Dusan,

Following up on Gerard's comment, we also read your nice paper with great 
interest. Your method appears most useful for cases with a limited number of 
reflections (e.g., small unit cell and/or low resolution) resulting in 5% test 
sets with less than 1000 reflections in total. It improves the performance of 
your implementation of ML refinement for the cases that you described. However, 
we don't think that you can conclude that cross-validation is not needed 
anymore. To quote your paper, in the Discussion section: 

To address the use of R free as indicator of wrong structures, we repeated the 
Kleywegt and Jones experiment (Kleywegt  Jones, 1995; Kleywegt  Jones, 1997) 
and built the 2ahn structure in the reverse direction and then refined it in the 
absence of solvent using the ML CV and ML FK approaches. Fig. 9 shows that 
Rfree stayed around 50% and Rfree–Rwork around 15% in the case of the reverse 
structure regardless of the ML approach and the fraction of data used in the 
test set. These values indicate that there is a fundamental problem with the 
structure, which supports the further use of Rfree as an indicator.

Thank you for reaffirming the utility of the statistical tool of 
cross-validation. The reverse chain trace of 2ahn is admittedly an extreme case 
of misfitting, and would probably be detected with other validation tools as 
well these days. However, the danger of overfitting or misfitting is still a 
very real possibility for large structures, especially when only moderate to 
low resolution data are available, even with today's tools.

Cross-validation can help even at very low resolution: in Structure 20, 957-966 
(2012) we showed that cross-validation is useful for certain low resolution 
refinements where additional restraints (DEN restraints in that case) are used 
to reduce overfitting and obtain a more accurate structure. Cross-validation 
made it possible to detect overfitting of the data when no DEN restraints were 
used. We believe this should also apply when other types of restraints are used 
(e.g., reference model restraints in phenix.refine, REFMAC, or BUSTER).  

In summary, we believe that cross-validation remains an important (and 
conceptually simple) method to detect overfitting and for overall structure 
validation.

Axel

Axel T. Brunger
Professor and Chair, Department of Molecular and Cellular Physiology
Investigator, HHMI
Email: brun...@stanford.edu mailto:brun...@stanford.edu
Phone: 650-736-1031
Web: http://atbweb.stanford.edu http://atbweb.stanford.edu/

Paul

Paul Adams
Deputy Division Director, Physical Biosciences Division, Lawrence Berkeley Lab
Division Deputy for Biosciences, Advanced Light Source, Lawrence Berkeley Lab
Adjunct Professor, Department of Bioengineering, U.C. Berkeley
Vice President for Technology, the Joint BioEnergy Institute
Laboratory Research Manager, ENIGMA Science Focus Area

Tel: 1-510-486-4225, Fax: 1-510-486-5909

http://cci.lbl.gov/paul http://cci.lbl.gov/paul
 On Jun 5, 2015, at 2:18 AM, Gerard Bricogne g...@globalphasing.com wrote:
 
 Dear Dusan,
 
 This is a nice paper and an interestingly different approach to
 avoiding bias and/or quantifying errors - and indeed there are all
 kinds of possibilities if you have a particular structure on which you
 are prepared to spend unlimited time and resources.
 
 The specific context in which Graeme's initial question led me to
 query instead who should set the FreeR flags, at what stage and on
 what basis? was that of the data analysis linked to high-throughput
 fragment screening, in which speed is of the essence at every step. 
 
 Creating FreeR flags afresh for each target-fragment complex
 dataset without any reference to those used in the refinement of the
 apo structure is by no means an irrecoverable error, but it will take
 extra computing time to let the refinement of the complex adjust to a
 new free set, starting from a model refined with the ignored one. It
 is in order to avoid the need for that extra time, or for a recourse
 to various debiasing methods, that the book-keeping faff described
 yesterday has been introduced. Operating without it is perfectly
 feasible, it is just likely to not be optimally direct.
 
 I will probably bow out here, before someone asks How many
 [e-mails from me] is too many? :-) .
 
 
 With best wishes,
 
  Gerard.
 
 --
 On Fri, Jun 05, 2015 at 09:14:18AM +0200, dusan turk wrote:
 Graeme,
 one more suggestion. You can avoid all the recipes by use all data for WORK 
 set and 0 reflections for TEST set regardless of the amount of data by using 
 the FREE KICK ML target. For explanation see our recent paper Praznikar, J. 
  Turk, D. (2014) Free kick instead of cross-validation in 
 maximum-likelihood refinement of macromolecular crystal structures. Acta 
 Cryst. D70, 3124-3134. 
 
 Link to the paper you can find at “http://www-bmb.ijs.si/doc/references.HTML”
 
 best,
 dusan
 
 
 
 On Jun 5, 2015, at 

Re: [ccp4bb] How many is too many free reflections?

2015-06-05 Thread Gerard Bricogne
Dear Frank,

 I was going to reply to Ian's last comment last night, but got
distracted.

 This last paragraph of Ian's message does sound rather negative
if detached from the context of the previous one, which was about
non-isomorphism between fragment complexes and the apo being the rule
rather than the exception. Ian uses the Crick-Magdoff definition of
an acceptable level of non-isomorphism, which is quite a stringent one
because its refers to a level that would invalidate isomorphism for
experimental phasing purposes. A much greater level of non-isomorphism
can be tolerated when it comes to solving a target-fragment complex
starting from the apo structure, so the Crick-Magdoff criterion is not
relevant here.

 Furthermore I think that Ian identifies perhaps too readily the
effect of non-isomorphism in creating noise in the comparison of
intensities and its effect on invalidating the working vs. free status
of observations. I think, therefore, that Ian's claim that failing the
Crick-Magdoff criterion for isomorphism results in scrambling the
distinction between the working set and the free set is a very big
overstatement.

 You describe as bookkeeping faff the procedures that Ian and I
outlined to preserve the FreeR flags of the apo refinement, and ask
for a paper. These matters are probably not glamorous enough to find
their way into papers, and would best be discussed (or re-discussed)
in a specialised BB like this one. If the shift from the question How
many is too many to How the free set should be chosen that I tried
to bring about yesterday results in a general sharing of evidence that
otherwise gets set aside, I will be very happy. I would find it unwise
to dismiss this question by expecting that there would be a mountain
of published evidence if it was really important. 

 Let us go ahead, then: could everyone who has evidence (rather
than preconceptions) on this matter please come forward and share it?
Answering this question is very important, even if the conclusion is
that the faff is unimportant.


 With best wishes,
 
  Gerard.

--
On Thu, Jun 04, 2015 at 10:43:15PM +0100, Frank von Delft wrote:
 I'm afraid Gerard an Ian between them have left me a bit confused
 with conflicting statements:
 
 
 On 04/06/2015 15:29, Gerard Bricogne wrote:
 snip
 In order to guard the detection of putative bound fragments against the 
 evils of model bias, it is very important to ensure that the refinement of 
 each complex against data collected on it does not treat as free any 
 reflections that were part of the working set in the refinement of the apo 
 structure.
 snip
 
 On 04/06/2015 17:34, Ian Tickle wrote:
 snip
 So I suspect that most of our efforts in maintaining common free R
 flags are for nothing; however it saves arguments with referees
 when it comes to publication!
 snip
 
 
 I also remember conversations and even BB threads that made me
 conclude that it did NOT matter to have the same Rfree set for
 independent datasets (e.g. different crystals).  I confess I don't
 remember the arguments, only the relief at not having to bother with
 all the bookkeeping faff Gerard outlines and Ian describes.
 
 So:  could someone explain in detail why this matters (or why not),
 and is there a URL to the evidence (paper or anything else) in
 either direction?
 
 (As far as I remember, the argument went that identical free sets
 were unnecessary even for exactly isomorphous crystals.  Something
 like this:  model bias is not a big deal when the model has largely
 converged, and that's what you have for molecular substitution (as
 Jim Pflugrath calls it).  In addition, even a weakly binding
 fragment compounds produces intensity perturbations large enough to
 make model bias irrelevant.)
 
 phx


Re: [ccp4bb] How many is too many free reflections?

2015-06-05 Thread Gerard Bricogne
Dear Dusan,

 This is a nice paper and an interestingly different approach to
avoiding bias and/or quantifying errors - and indeed there are all
kinds of possibilities if you have a particular structure on which you
are prepared to spend unlimited time and resources.

 The specific context in which Graeme's initial question led me to
query instead who should set the FreeR flags, at what stage and on
what basis? was that of the data analysis linked to high-throughput
fragment screening, in which speed is of the essence at every step. 

 Creating FreeR flags afresh for each target-fragment complex
dataset without any reference to those used in the refinement of the
apo structure is by no means an irrecoverable error, but it will take
extra computing time to let the refinement of the complex adjust to a
new free set, starting from a model refined with the ignored one. It
is in order to avoid the need for that extra time, or for a recourse
to various debiasing methods, that the book-keeping faff described
yesterday has been introduced. Operating without it is perfectly
feasible, it is just likely to not be optimally direct.

 I will probably bow out here, before someone asks How many
[e-mails from me] is too many? :-) .


 With best wishes,
 
  Gerard.

--
On Fri, Jun 05, 2015 at 09:14:18AM +0200, dusan turk wrote:
 Graeme,
 one more suggestion. You can avoid all the recipes by use all data for WORK 
 set and 0 reflections for TEST set regardless of the amount of data by using 
 the FREE KICK ML target. For explanation see our recent paper Praznikar, J.  
 Turk, D. (2014) Free kick instead of cross-validation in maximum-likelihood 
 refinement of macromolecular crystal structures. Acta Cryst. D70, 3124-3134. 
 
 Link to the paper you can find at “http://www-bmb.ijs.si/doc/references.HTML”
 
 best,
 dusan
 
  
 
  On Jun 5, 2015, at 1:03 AM, CCP4BB automatic digest system 
  lists...@jiscmail.ac.uk wrote:
  
  Date:Thu, 4 Jun 2015 08:30:57 +
  From:Graeme Winter graeme.win...@gmail.com
  Subject: Re: How many is too many free reflections?
  
  Hi Folks,
  
  Many thanks for all of your comments - in keeping with the spirit of the BB
  I have digested the responses below. Interestingly I suspect that the
  responses to this question indicate the very wide range of resolution
  limits of the data people work with!
  
  Best wishes Graeme
  
  ===
  
  Proposal 1:
  
  10% reflections, max 2000
  
  Proposal 2: from wiki:
  
  http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/Test_set
  
  including Randy Read recipe:
  
  So here's the recipe I would use, for what it's worth:
   1 reflections:set aside 10%
1-2 reflections:  set aside 1000 reflections
2-4 reflections:  set aside 5%
  4 reflections:set aside 2000 reflections
  
  Proposal 3:
  
  5% maximum 2-5k
  
  Proposal 4:
  
  3% minimum 1000
  
  Proposal 5:
  
  5-10% of reflections, minimum 1000
  
  Proposal 6:
  
  50 reflections per bin in order to get reliable ML parameter
  estimation, ideally around 150 / bin.
  
  Proposal 7:
  
  If lots of reflections (i.e. 800K unique) around 1% selected - 5% would be
  40k i.e. rather a lot. Referees question use of  5k reflections as test
  set.
  
  Comment 1 in response to this:
  
  Surely absolute # of test reflections is not relevant, percentage is.
  
  
  
  Approximate consensus (i.e. what I will look at doing in xia2) - probably
  follow Randy Read recipe from ccp4wiki as this seems to (probably) satisfy
  most of the criteria raised by everyone else.
  
  
  
  On Tue, Jun 2, 2015 at 11:26 AM Graeme Winter graeme.win...@gmail.com
  wrote:
  
  Hi Folks
  
  Had a vague comment handed my way that xia2 assigns too many free
  reflections - I have a feeling that by default it makes a free set of 5%
  which was OK back in the day (like I/sig(I) = 2 was OK) but maybe seems
  excessive now.
  
  This was particularly in the case of high resolution data where you have a
  lot of reflections, so 5% could be several thousand which would be more
  than you need to just check Rfree seems OK.
  
  Since I really don't know what is the right # reflections to assign to a
  free set thought I would ask here - what do you think? Essentially I need
  to assign a minimum %age or minimum # - the lower of the two presumably?
  
  Any comments welcome!
  
  Thanks  best wishes Graeme
  
  
 
 Dr. Dusan Turk, Prof.
 Head of Structural Biology Group http://bio.ijs.si/sbl/ 
 Head of Centre for Protein  and Structure Production
 Centre of excellence for Integrated Approaches in Chemistry and Biology of 
 Proteins, Scientific Director
 http://www.cipkebip.org/
 Professor of Structural Biology at IPS Jozef Stefan
 e-mail: dusan.t...@ijs.si
 phone: +386 1 477 3857   Dept. of Biochem. Mol. Struct. Biol.
 fax:   +386 1 477 3984   Jozef Stefan Institute
  

Re: [ccp4bb] How many is too many free reflections?

2015-06-05 Thread dusan turk
Graeme,
one more suggestion. You can avoid all the recipes by use all data for WORK set 
and 0 reflections for TEST set regardless of the amount of data by using the 
FREE KICK ML target. For explanation see our recent paper Praznikar, J.  Turk, 
D. (2014) Free kick instead of cross-validation in maximum-likelihood 
refinement of macromolecular crystal structures. Acta Cryst. D70, 3124-3134. 

Link to the paper you can find at “http://www-bmb.ijs.si/doc/references.HTML”

best,
dusan

 

 On Jun 5, 2015, at 1:03 AM, CCP4BB automatic digest system 
 lists...@jiscmail.ac.uk wrote:
 
 Date:Thu, 4 Jun 2015 08:30:57 +
 From:Graeme Winter graeme.win...@gmail.com
 Subject: Re: How many is too many free reflections?
 
 Hi Folks,
 
 Many thanks for all of your comments - in keeping with the spirit of the BB
 I have digested the responses below. Interestingly I suspect that the
 responses to this question indicate the very wide range of resolution
 limits of the data people work with!
 
 Best wishes Graeme
 
 ===
 
 Proposal 1:
 
 10% reflections, max 2000
 
 Proposal 2: from wiki:
 
 http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/Test_set
 
 including Randy Read recipe:
 
 So here's the recipe I would use, for what it's worth:
  1 reflections:set aside 10%
   1-2 reflections:  set aside 1000 reflections
   2-4 reflections:  set aside 5%
 4 reflections:set aside 2000 reflections
 
 Proposal 3:
 
 5% maximum 2-5k
 
 Proposal 4:
 
 3% minimum 1000
 
 Proposal 5:
 
 5-10% of reflections, minimum 1000
 
 Proposal 6:
 
 50 reflections per bin in order to get reliable ML parameter
 estimation, ideally around 150 / bin.
 
 Proposal 7:
 
 If lots of reflections (i.e. 800K unique) around 1% selected - 5% would be
 40k i.e. rather a lot. Referees question use of  5k reflections as test
 set.
 
 Comment 1 in response to this:
 
 Surely absolute # of test reflections is not relevant, percentage is.
 
 
 
 Approximate consensus (i.e. what I will look at doing in xia2) - probably
 follow Randy Read recipe from ccp4wiki as this seems to (probably) satisfy
 most of the criteria raised by everyone else.
 
 
 
 On Tue, Jun 2, 2015 at 11:26 AM Graeme Winter graeme.win...@gmail.com
 wrote:
 
 Hi Folks
 
 Had a vague comment handed my way that xia2 assigns too many free
 reflections - I have a feeling that by default it makes a free set of 5%
 which was OK back in the day (like I/sig(I) = 2 was OK) but maybe seems
 excessive now.
 
 This was particularly in the case of high resolution data where you have a
 lot of reflections, so 5% could be several thousand which would be more
 than you need to just check Rfree seems OK.
 
 Since I really don't know what is the right # reflections to assign to a
 free set thought I would ask here - what do you think? Essentially I need
 to assign a minimum %age or minimum # - the lower of the two presumably?
 
 Any comments welcome!
 
 Thanks  best wishes Graeme
 
 

Dr. Dusan Turk, Prof.
Head of Structural Biology Group http://bio.ijs.si/sbl/ 
Head of Centre for Protein  and Structure Production
Centre of excellence for Integrated Approaches in Chemistry and Biology of 
Proteins, Scientific Director
http://www.cipkebip.org/
Professor of Structural Biology at IPS Jozef Stefan
e-mail: dusan.t...@ijs.si
phone: +386 1 477 3857   Dept. of Biochem. Mol. Struct. Biol.
fax:   +386 1 477 3984   Jozef Stefan Institute
Jamova 39, 1 000 Ljubljana,Slovenia
Skype: dusan.turk (voice over internet: www.skype.com


Re: [ccp4bb] How many is too many free reflections?

2015-06-04 Thread Frank von Delft
I'm afraid Gerard an Ian between them have left me a bit confused with 
conflicting statements:



On 04/06/2015 15:29, Gerard Bricogne wrote:

snip
In order to guard the detection of putative bound fragments against the evils 
of model bias, it is very important to ensure that the refinement of each 
complex against data collected on it does not treat as free any reflections 
that were part of the working set in the refinement of the apo structure.
snip


On 04/06/2015 17:34, Ian Tickle wrote:

snip
So I suspect that most of our efforts in maintaining common free R 
flags are for nothing; however it saves arguments with referees when 
it comes to publication!

snip



I also remember conversations and even BB threads that made me conclude 
that it did NOT matter to have the same Rfree set for independent 
datasets (e.g. different crystals).  I confess I don't remember the 
arguments, only the relief at not having to bother with all the 
bookkeeping faff Gerard outlines and Ian describes.


So:  could someone explain in detail why this matters (or why not), and 
is there a URL to the evidence (paper or anything else) in either 
direction?


(As far as I remember, the argument went that identical free sets were 
unnecessary even for exactly isomorphous crystals.  Something like 
this:  model bias is not a big deal when the model has largely 
converged, and that's what you have for molecular substitution (as Jim 
Pflugrath calls it).  In addition, even a weakly binding fragment 
compounds produces intensity perturbations large enough to make model 
bias irrelevant.)


phx


Re: [ccp4bb] How many is too many free reflections?

2015-06-04 Thread Pavel Afonine
  It seems to me that the how many is too many aspect of this
 question, and the various culinary procedures that have been proposed
 as answers, may have obscured another, much more fundamental issue,
 namely: is it really the business of the data processing package to
 assign FreeR flags?

  I would argue that it isn't. (...)



Excellent point! I can't agree more.

Pavel


Re: [ccp4bb] How many is too many free reflections?

2015-06-04 Thread Edward A. Berry

In other words, the free set for each complex must be
such that reflections that are also present in the apo dataset retain
the FreeR flag they had in that dataset.


A very easy way to achieve this- generate a complete dataset to ridiculously
high resolution with the cell of your crystal, and assign free-r flags.
(If the first structure has been already solved, merge it's free set and
extend to the new reflections)
Now for every new structure solved, discard any free set that the data
reduction program may have generated and merge with the complete set,
discarding reflection with no Fobs (MNF) or with SigF=0.

In fact, if we consider a dataset is just a 3-dimensional array, or some
subset of it enclosing the reciprocal space asymmetric unit, I don't
see any reason we couldn't assign one universal P1 free-R set and
use it for every structure in whatever space group. By taking each
new dataset, merging with the universal Free-R, and discarding those
reflections not present in the new data, you would obtain a random
set for your structure. There could be nested (concentric?) free-R sets
with 10%, 5%, 2%, 1% free so that if you start out excluding 5% for a
low-res structure then get a high resolution dataset and want to exclude 2%,
you could be sure that all the 2% free reflections were also free in
your previous 5% set.

Thin or thick shells could be predefined. There may be problems when
it is desired to exclude reflections according to some twin law or NCS.

(just now read Nick Keep's post which expresses some similar ideas)
eab

On 06/04/2015 10:29 AM, Gerard Bricogne wrote:

Dear Graeme and other contributors to this thread,

  It seems to me that the how many is too many aspect of this
question, and the various culinary procedures that have been proposed
as answers, may have obscured another, much more fundamental issue,
namely: is it really the business of the data processing package to
assign FreeR flags?

  I would argue that it isn't. From the statistical viewpoint that
justifies the need for FreeR flags, these are pre-refinement entities
rather than post-processing ones. If one considers a single instance
of going from a dataset to a refined structure, then this distinction
may seem artificial. Consider, instead, the case of high-throughput
screening to detect fragment binding on a large number of crystals of
complexes between a given target protein (the apo) and a multitude
of small, weakly-binding fragments into solutions of which crystals of
the apo have been soaked.

  The model for the apo crystal structure comes from a refinement
against a dataset, using a certain set of FreeR flags. In order to
guard the detection of putative bound fragments against the evils of
model bias, it is very important to ensure that the refinement of each
complex against data collected on it does not treat as free any
reflections that were part of the working set in the refinement of the
apo structure. In other words, the free set for each complex must be
such that reflections that are also present in the apo dataset retain
the FreeR flag they had in that dataset. Any mixup, in the FreeR flags
for a complex, of the work vs. free status of the reflections also in
the apo would push Rwork up and Rfree down, invalidating their role as
indicators of quality of fit or of incipient overfitting.

  Great care must therefore be exercised, in the form of adequate
book-keeping and procedures for generating the FreeR flags in the mtz
file for each complex from that for the apo, to properly enforce this
inheritance of work vs. free status.

  In such a context there is a clear and crucial difference between
a post-processing entity and a pre-refinement one. FreeR flags belong
to the latter category. In fact, the creation of FreeR flags at the
end of the processing step can create a false perception, among people
doing ligand screening under pressure, that they cannot re-use the
FreeR flag information of the apo in refining their complexes, simply
because a new set has been created for each of them. This is clearly
to be avoided. Preserving the FreeR flags of the reflections that were
used in the refinement of the apo structure is one of the explicit
recommendations explicitly in the 2013 paper by Pozharski et al. (Acta
Cryst. D69, 150-167) - see section 1.1.3, p.152.

  Best practice in this area may therefore not be only a question
of numbers, but also of doing the appropriate thing in the appropriate
place. There are of course corner cases where e.g. substantial
unit-cell changes start to introduce some cross-talk between working
and free reflections, but the possibililty of such complications is no
argument to justify giving up on doing the right thing when the right
thing can be done.


  With best wishes,

   Gerard.

--
On Thu, Jun 04, 2015 at 08:30:57AM +, Graeme Winter wrote:

Hi Folks,

Many thanks for all of your comments - in keeping with the spirit of the BB
I have digested 

Re: [ccp4bb] How many is too many free reflections?

2015-06-04 Thread James Holton
Many good points have been made on this thread so far, but mostly 
addressing the question how many free reflections is enough, whereas 
the original question was how many is too many.


I suppose a reasonable definition of too many is when the error 
introduced into the map by leaving out all those reflections start to 
become a problem.  It is easy to calculate this error: it is simply the 
difference between the map made using all  reflections (regardless of 
Free-R flag) and the map made with 5% of the reflections left out.  Of 
course, this difference map is identical to a map calculated using 
only the 5% free reflections, setting all others to zero.  The RMS 
variation of this error map is actually independent of the phases used 
(Parseval's theorem), and it ends up being:


RMSerror = RMSall * sqrt( free_frac )
where:
RMSerror is the RMS variation of the error map
RMSall is the RMS variation of the map calculated with all reflections
free_frac is the fraction of hkls left out of the calculation.

So, with 5% free reflections, the errors induced in the electron density 
will have an RMS variation that is 22.3% of the full map's RMS 
variation, or 0.223 sigma units.  1% free reflections will result in 
RMS 10% error, or 0.1 sigmas.  This means, for example, that with 5% 
free reflections a 1.0 sigma peak might come up as a 1.2 or 0.8 
sigma feature.  Note that these are not the sigmas of the Fo-Fc map, 
(which changes as you build) but rather the sigma of the Fo map.  Most 
of us don't look at Fo maps, but rather 2Fo-Fc or 2mFo-DFc maps, with or 
without the missing reflections filled in.  These are a bit different 
from a straight Fo map.  The absolute electron number density (e-/A^3) 
of the 1 sigma contour for all these maps is about the same, but no 
doubt the fill in, extra Fo-Fc term, and the likelihood weights 
reduces the overall RMS error.  By how much?  That is a good question.


Still, we can take this RMS 0.223 sigma variation from 5% free 
reflections as a worst-case scenario, and then ask the question: is this 
a problem?  Well, any source of error can be a problem, but when you 
are trying to find the best compromise between two 
difficult-to-reconcile considerations (such as the stability of Rfree 
and the interpretability of the map), it is usually helpful to bring in 
a third consideration: such as how much noise is in the map already due 
to other sources?  My colleagues and I measured this recently (doi: 
10.1073/pnas.1302823110), and found that the 1-sigma contour ranges from 
0.8 to 1.2 e-/A^3 (relative to vacuum), experimental measurement errors 
are RMS ~0.04 e-/A^3 and map errors from the model-data difference is 
about RMS 0.13 e-/A^3.  So, 22.3% of sigma is around RMS 0.22 e-/A^3.  
This is a bit larger than our biggest empirically-measured error: the 
modelling error, indicating that 5% free flags may indeed be too much.


However, 22.3% is the worst-case error, in the absence of all the 
corrections used to make 2mFo-DFc maps, so in reality the modelling 
error and the omitted-reflection errors are probably comparable, 
indicating that 5% is about the right amount.  Any more and the error 
from omitted reflections starts to dominate the total error.   On the 
other hand, the modelling error is (by definition) the Fo-Fc difference, 
so as Rwork/Rfree get smaller the RMS map variation due to modelling 
errors gets smaller as well, eventually exposing the omitted-reflection 
error.  So, once your Rwork/Rfree get to be less than ~22%, the errors 
in the map are starting to be dominated by the missing Fs of the 5% free 
set.


However, early in the refinement, when your R factors are in the 30%s, 
40%s, or even 50%s, I don't think the errors due to missing 5% of the 
reflections are going to be important.  Then again, late in refinement, 
it might be a good idea to start including some or all of the free 
reflections back into the working set in order to reduce the overall map 
error (cue lamentations from validation experts such as Jane Richardson).


This is perhaps the most important topic on this thread.  There are so 
many ways to contaminate, bias or otherwise compromise the free set, 
and once done we don't have generally accepted procedures for 
re-sanctifying the free reflections, other that starting over again from 
scratch.  This is especially problematic if your starting structure for 
molecular replacement was refined against all reflections, and your 
ligand soak is nice and isomorphous to those original crystals.  How do 
you remove the evil bias from this model?  You can try shaking it, but 
that only really removes bias at high spatial frequencies and is not so 
effective at low resolution.
So, if bias is so easy to generate why not use it to our advantage?  
Instead of leaving the free-flagged reflections out of the refinement, 
put them in, but give them random F values.  Then do everything you can 
to bias your model toward these random values.  Loosen the 

Re: [ccp4bb] How many is too many free reflections?

2015-06-04 Thread Ian Tickle
Nick

What you describe is (almost) exactly the way we have always done it at
Astex  I'm surprised to hear that others are not routinely doing the
same.  The difference is that we don't generate a free R flag MTZ file to
ultra-high resolution as you suggest, since there's never any need to.
What we do is generate by default a 1.5 Ang. free R flag file using UNIQUE,
FREERFLAG and MTZUTILS whenever a new apo structure for a given
target/crystal form is solved and keep that with the intial apo data as a
reference dataset for auto-re-indexing (so that all the protein-ligand
datasets are indexed the same way).  When a dataset is combined with the
higher resolution free R flag file we would of course cut the resolution to
that of the data (still keeping the original free R flag file), mainly in
order to save space in the database.

Obviously if the initial apo data were higher resolution than 1.5 Ang, the
processing script would generate an initial free R flag file also
correspondlingly higher (say to 1 Ang.).  If a ligand dataset comes along
later at higher resolution than 1.5 Ang. the script would do the same
thing, but then it would use the MTZUTILS UNIQ option to merge the old free
R flags up to 1.5 Ang. with the new ones between 1.5 and 1 Ang.  Then it
would combine the data file with the free R flag file as before and cut the
resolution of the combined data file to the actual resolution of the data.
The script would then replace the old free R flag file with the new one and
use the latter for all subsequent datasets from that target/crystal form.
The users are completely unaware that any of this is happening (unless they
want to dig into the scripts!).

We enforce use of 'approved' scripts for all the processing and refinement
essentially by using an Oracle database with web-based access
authentication which means that if you don't use the approved scripts to
process your data then you can't upload your data to the database, which
then means that no-one else will get to see and/or use your results!  Our
scripts make full use of CCP4 and Global Phasing programs (autoPROC,
autoBUSTER, GRADE etc): however using CCP4i or other programs from the
command line to process the data and only uploading the final results to
the database is severely deprecated (and totally unsupported!), mainly
because there will then be no permanent traceback in the database of the
user's actions for others to see.

On Gerard's final point of the effect on non-isomorphism, we find that
isomorphism is the exception rather than the rule, i.e. the majority of our
datasets would fail the Crick-Magdoff test for isomorphism (i.e. no more
than 0.5% change for all cell lengths for 3 Ang. data and a correspondingly
lower threshold at more typical resolution limits of 2 - 1.5 Ang.).  This
is obviously very target and crystal form-dependent, some targets/crystal
forms give more isomorphous crystals than others.  So I suspect that most
of our efforts in maintaining common free R flags are for nothing; however
it saves arguments with referees when it comes to publication!

Cheers

-- Ian


On 4 June 2015 at 16:00, Nicholas Keep n.k...@mail.cryst.bbk.ac.uk wrote:

 I agree with Gerard.  It would be much better in many ways to generate a
 separate file of Free R flags for each crystal form of a project to some
 high resolution that is unlikely to ever be exceeded eg 0.4 A that is a
 separate input file to refinement rather than in the mtz.


 The generation of this free set could ask some questions like is the data
 twinned, do you want to extend the free set from a higher symmetry free
 set.  eg C2 rather than C2221 (symmetry is close to the higher symmetry but
 not perfect- seems to happen not infrequently).

 Could some judicious selection of sets of related potentially related hkls
 work as a universal free set? (Not thought this through fully)

 This would get around practical issues like I had yestserday in refining
 in another well known package where coot drew the map as if it was 0.5 A
 data even though there were only observed data to 2.1 the rest just being a
 hopelessly overoptimistic guess of the best ever dataset we might collect.

 I agree you CAN do this with current software- it is just not the path of
 least resistance, so you have to double check your group are doing this.

 Best wishes
 Nick





 --
 Prof Nicholas H. Keep
 Executive Dean of School of Science
 Professor of Biomolecular Science
 Crystallography, Institute for Structural and Molecular Biology,
 Department of Biological Sciences
 Birkbeck,  University of London,
 Malet Street,
 Bloomsbury
 LONDON
 WC1E 7HX

 email n.k...@mail.cryst.bbk.ac.uk
 Telephone 020-7631-6852  (Room G54a Office)
   020-7631-6800  (Department Office)
 Fax   020-7631-6803
 If you want to access me in person you have to come to the crystallography
 entrance
 and ring me or the department office from the internal phone by the door



Re: [ccp4bb] How many is too many free reflections?

2015-06-04 Thread Gerard Bricogne
Dear Graeme and other contributors to this thread,

 It seems to me that the how many is too many aspect of this
question, and the various culinary procedures that have been proposed
as answers, may have obscured another, much more fundamental issue,
namely: is it really the business of the data processing package to
assign FreeR flags?

 I would argue that it isn't. From the statistical viewpoint that
justifies the need for FreeR flags, these are pre-refinement entities
rather than post-processing ones. If one considers a single instance
of going from a dataset to a refined structure, then this distinction
may seem artificial. Consider, instead, the case of high-throughput
screening to detect fragment binding on a large number of crystals of
complexes between a given target protein (the apo) and a multitude
of small, weakly-binding fragments into solutions of which crystals of
the apo have been soaked.

 The model for the apo crystal structure comes from a refinement
against a dataset, using a certain set of FreeR flags. In order to
guard the detection of putative bound fragments against the evils of
model bias, it is very important to ensure that the refinement of each
complex against data collected on it does not treat as free any
reflections that were part of the working set in the refinement of the
apo structure. In other words, the free set for each complex must be
such that reflections that are also present in the apo dataset retain
the FreeR flag they had in that dataset. Any mixup, in the FreeR flags
for a complex, of the work vs. free status of the reflections also in
the apo would push Rwork up and Rfree down, invalidating their role as
indicators of quality of fit or of incipient overfitting.

 Great care must therefore be exercised, in the form of adequate
book-keeping and procedures for generating the FreeR flags in the mtz
file for each complex from that for the apo, to properly enforce this 
inheritance of work vs. free status.

 In such a context there is a clear and crucial difference between
a post-processing entity and a pre-refinement one. FreeR flags belong
to the latter category. In fact, the creation of FreeR flags at the
end of the processing step can create a false perception, among people
doing ligand screening under pressure, that they cannot re-use the
FreeR flag information of the apo in refining their complexes, simply
because a new set has been created for each of them. This is clearly
to be avoided. Preserving the FreeR flags of the reflections that were
used in the refinement of the apo structure is one of the explicit
recommendations explicitly in the 2013 paper by Pozharski et al. (Acta
Cryst. D69, 150-167) - see section 1.1.3, p.152.

 Best practice in this area may therefore not be only a question
of numbers, but also of doing the appropriate thing in the appropriate
place. There are of course corner cases where e.g. substantial
unit-cell changes start to introduce some cross-talk between working
and free reflections, but the possibililty of such complications is no
argument to justify giving up on doing the right thing when the right
thing can be done.


 With best wishes,
  
  Gerard.

--
On Thu, Jun 04, 2015 at 08:30:57AM +, Graeme Winter wrote:
 Hi Folks,
 
 Many thanks for all of your comments - in keeping with the spirit of the BB
 I have digested the responses below. Interestingly I suspect that the
 responses to this question indicate the very wide range of resolution
 limits of the data people work with!
 
 Best wishes Graeme
 
 ===
 
 Proposal 1:
 
 10% reflections, max 2000
 
 Proposal 2: from wiki:
 
 http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/Test_set
 
 including Randy Read recipe:
 
 So here's the recipe I would use, for what it's worth:
   1 reflections:set aside 10%
1-2 reflections:  set aside 1000 reflections
2-4 reflections:  set aside 5%
   4 reflections:set aside 2000 reflections
 
 Proposal 3:
 
 5% maximum 2-5k
 
 Proposal 4:
 
 3% minimum 1000
 
 Proposal 5:
 
 5-10% of reflections, minimum 1000
 
 Proposal 6:
 
  50 reflections per bin in order to get reliable ML parameter
 estimation, ideally around 150 / bin.
 
 Proposal 7:
 
 If lots of reflections (i.e. 800K unique) around 1% selected - 5% would be
 40k i.e. rather a lot. Referees question use of  5k reflections as test
 set.
 
 Comment 1 in response to this:
 
 Surely absolute # of test reflections is not relevant, percentage is.
 
 
 
 Approximate consensus (i.e. what I will look at doing in xia2) - probably
 follow Randy Read recipe from ccp4wiki as this seems to (probably) satisfy
 most of the criteria raised by everyone else.
 
 
 
 On Tue, Jun 2, 2015 at 11:26 AM Graeme Winter graeme.win...@gmail.com
 wrote:
 
  Hi Folks
 
  Had a vague comment handed my way that xia2 assigns too many free
  reflections - I have a 

Re: [ccp4bb] How many is too many free reflections?

2015-06-04 Thread Nicholas Keep
I agree with Gerard.  It would be much better in many ways to generate a 
separate file of Free R flags for each crystal form of a project to some 
high resolution that is unlikely to ever be exceeded eg 0.4 A that is a 
separate input file to refinement rather than in the mtz.



The generation of this free set could ask some questions like is the 
data twinned, do you want to extend the free set from a higher symmetry 
free set.  eg C2 rather than C2221 (symmetry is close to the higher 
symmetry but not perfect- seems to happen not infrequently).


Could some judicious selection of sets of related potentially related 
hkls work as a universal free set? (Not thought this through fully)


This would get around practical issues like I had yestserday in refining 
in another well known package where coot drew the map as if it was 0.5 
A data even though there were only observed data to 2.1 the rest just 
being a hopelessly overoptimistic guess of the best ever dataset we 
might collect.


I agree you CAN do this with current software- it is just not the path 
of least resistance, so you have to double check your group are doing this.


Best wishes
Nick





--
Prof Nicholas H. Keep
Executive Dean of School of Science
Professor of Biomolecular Science
Crystallography, Institute for Structural and Molecular Biology,
Department of Biological Sciences
Birkbeck,  University of London,
Malet Street,
Bloomsbury
LONDON
WC1E 7HX

email n.k...@mail.cryst.bbk.ac.uk
Telephone 020-7631-6852  (Room G54a Office)
  020-7631-6800  (Department Office)
Fax   020-7631-6803
If you want to access me in person you have to come to the crystallography 
entrance
and ring me or the department office from the internal phone by the door


Re: [ccp4bb] How many is too many free reflections?

2015-06-04 Thread Graeme Winter
Hi Folks,

Many thanks for all of your comments - in keeping with the spirit of the BB
I have digested the responses below. Interestingly I suspect that the
responses to this question indicate the very wide range of resolution
limits of the data people work with!

Best wishes Graeme

===

Proposal 1:

10% reflections, max 2000

Proposal 2: from wiki:

http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/Test_set

including Randy Read recipe:

So here's the recipe I would use, for what it's worth:
  1 reflections:set aside 10%
   1-2 reflections:  set aside 1000 reflections
   2-4 reflections:  set aside 5%
  4 reflections:set aside 2000 reflections

Proposal 3:

5% maximum 2-5k

Proposal 4:

3% minimum 1000

Proposal 5:

5-10% of reflections, minimum 1000

Proposal 6:

 50 reflections per bin in order to get reliable ML parameter
estimation, ideally around 150 / bin.

Proposal 7:

If lots of reflections (i.e. 800K unique) around 1% selected - 5% would be
40k i.e. rather a lot. Referees question use of  5k reflections as test
set.

Comment 1 in response to this:

Surely absolute # of test reflections is not relevant, percentage is.



Approximate consensus (i.e. what I will look at doing in xia2) - probably
follow Randy Read recipe from ccp4wiki as this seems to (probably) satisfy
most of the criteria raised by everyone else.



On Tue, Jun 2, 2015 at 11:26 AM Graeme Winter graeme.win...@gmail.com
wrote:

 Hi Folks

 Had a vague comment handed my way that xia2 assigns too many free
 reflections - I have a feeling that by default it makes a free set of 5%
 which was OK back in the day (like I/sig(I) = 2 was OK) but maybe seems
 excessive now.

 This was particularly in the case of high resolution data where you have a
 lot of reflections, so 5% could be several thousand which would be more
 than you need to just check Rfree seems OK.

 Since I really don't know what is the right # reflections to assign to a
 free set thought I would ask here - what do you think? Essentially I need
 to assign a minimum %age or minimum # - the lower of the two presumably?

 Any comments welcome!

 Thanks  best wishes Graeme



Re: [ccp4bb] How many is too many free reflections?

2015-06-02 Thread Robbie Joosten
Hi Graeme,

 

We have had similar discussion with PDB_REDO that is frequently forced to 
assign a new R-free set when the input data doesn’t have one (this still 
happens with new PDB entries!). The ‘500/1000/1500/2000 reflections’ is enough 
school seems to look only at the variance of R-free for different choices of 
test sets, which depends on the absolute number of reflections.  You also want 
a representative sample of reciprocal space which depends on the fraction of 
reflections. In PDB_REDO we make a new test set if:

-  The test set is smaller than 1% of the reflections

-  When the set has fewer than 500 reflections AND is smaller than 10% 
of the reflections.

 

The new set is chosen as at least 5% of the possible reflections given the cell 
parameters and the resolution. If there are between 2 and 1 
reflections, the percentage is increased to get at least 1000 reflections in 
the test set.  So the maximum percentage is 10%. 

 

Funny side note: The random number generator in freerflag was set up to always 
pick the same test set for given resolution and cell parameters, which is 
useful if you misplace your test set. Unfortunately, we also had data sets from 
the PDB where the newly generated test set had no observed reflections. Most of 
these datasets were close to 95% complete ;)

 

Cheers,

Robbie  

 

From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Graeme 
Winter
Sent: Tuesday, June 2, 2015 12:27
To: CCP4BB@JISCMAIL.AC.UK
Subject: [ccp4bb] How many is too many free reflections?

 

Hi Folks

 

Had a vague comment handed my way that xia2 assigns too many free reflections 
- I have a feeling that by default it makes a free set of 5% which was OK back 
in the day (like I/sig(I) = 2 was OK) but maybe seems excessive now.

 

This was particularly in the case of high resolution data where you have a lot 
of reflections, so 5% could be several thousand which would be more than you 
need to just check Rfree seems OK.

 

Since I really don't know what is the right # reflections to assign to a free 
set thought I would ask here - what do you think? Essentially I need to assign 
a minimum %age or minimum # - the lower of the two presumably?

 

Any comments welcome!

 

Thanks  best wishes Graeme



Re: [ccp4bb] How many is too many free reflections?

2015-06-02 Thread Folmer Fredslund
Hi Graeme,

There's a very nice page on the (unofficial?) CCP4 wiki about it
http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/Test_set

For structures with a lot of reflections, a rule of thumb would be that
2000 free reflections would give an adequate reliability in the free R
factor.

Hope this helps,
Folmer Fredslund


2015-06-02 12:26 GMT+02:00 Graeme Winter graeme.win...@gmail.com:

 Hi Folks

 Had a vague comment handed my way that xia2 assigns too many free
 reflections - I have a feeling that by default it makes a free set of 5%
 which was OK back in the day (like I/sig(I) = 2 was OK) but maybe seems
 excessive now.

 This was particularly in the case of high resolution data where you have a
 lot of reflections, so 5% could be several thousand which would be more
 than you need to just check Rfree seems OK.

 Since I really don't know what is the right # reflections to assign to a
 free set thought I would ask here - what do you think? Essentially I need
 to assign a minimum %age or minimum # - the lower of the two presumably?

 Any comments welcome!

 Thanks  best wishes Graeme




-- 
Folmer Fredslund


Re: [ccp4bb] How many is too many free reflections?

2015-06-02 Thread Pavel Afonine
Hi Graeme,

free reflections are used for two purposes, at least: cross-validation
(calculation of Rfree) and ML parameters estimation (sigmaa or alpha/beta).
For the latter it is important that each relatively thin resolution bin
(sufficiently thin so that alpha/beta can be considered constants in it)
receives no less than 50 reflections absolute min; in Phenix we found that
~150 per bin is sufficient and this is what's used by default.

Pavel

On Tue, Jun 2, 2015 at 3:26 AM, Graeme Winter graeme.win...@gmail.com
wrote:

 Hi Folks

 Had a vague comment handed my way that xia2 assigns too many free
 reflections - I have a feeling that by default it makes a free set of 5%
 which was OK back in the day (like I/sig(I) = 2 was OK) but maybe seems
 excessive now.

 This was particularly in the case of high resolution data where you have a
 lot of reflections, so 5% could be several thousand which would be more
 than you need to just check Rfree seems OK.

 Since I really don't know what is the right # reflections to assign to a
 free set thought I would ask here - what do you think? Essentially I need
 to assign a minimum %age or minimum # - the lower of the two presumably?

 Any comments welcome!

 Thanks  best wishes Graeme



[ccp4bb] How many is too many free reflections?

2015-06-02 Thread Graeme Winter
Hi Folks

Had a vague comment handed my way that xia2 assigns too many free
reflections - I have a feeling that by default it makes a free set of 5%
which was OK back in the day (like I/sig(I) = 2 was OK) but maybe seems
excessive now.

This was particularly in the case of high resolution data where you have a
lot of reflections, so 5% could be several thousand which would be more
than you need to just check Rfree seems OK.

Since I really don't know what is the right # reflections to assign to a
free set thought I would ask here - what do you think? Essentially I need
to assign a minimum %age or minimum # - the lower of the two presumably?

Any comments welcome!

Thanks  best wishes Graeme


Re: [ccp4bb] How many is too many free reflections?

2015-06-02 Thread Isupov, Michail
Hi Graeme,

in a data set with just below  800,000 independent reflections I use 1 % for 
freeR which
is still impressive 8,000. xia2 would have assigned 40,000 for freeR
at 5 %. I think this is way too much.

Often we collect many data sets of the same project to find the better data.
We do use default xia2 FreeR assignments at this stage, and after locating the 
best
data set we can not go back and reassign FreeR, as the new set will be biased 
towards
the model.
Referees/editors however query cases when over 5,000 reflections were used for 
cross-validation.

Misha



From: CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UK] on behalf of Pavel Afonine 
[pafon...@gmail.com]
Sent: Tuesday, June 2, 2015 3:10 PM
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] How many is too many free reflections?

Hi Graeme,

free reflections are used for two purposes, at least: cross-validation 
(calculation of Rfree) and ML parameters estimation (sigmaa or alpha/beta). For 
the latter it is important that each relatively thin resolution bin 
(sufficiently thin so that alpha/beta can be considered constants in it) 
receives no less than 50 reflections absolute min; in Phenix we found that ~150 
per bin is sufficient and this is what's used by default.

Pavel

On Tue, Jun 2, 2015 at 3:26 AM, Graeme Winter 
graeme.win...@gmail.commailto:graeme.win...@gmail.com wrote:
Hi Folks

Had a vague comment handed my way that xia2 assigns too many free reflections 
- I have a feeling that by default it makes a free set of 5% which was OK back 
in the day (like I/sig(I) = 2 was OK) but maybe seems excessive now.

This was particularly in the case of high resolution data where you have a lot 
of reflections, so 5% could be several thousand which would be more than you 
need to just check Rfree seems OK.

Since I really don't know what is the right # reflections to assign to a free 
set thought I would ask here - what do you think? Essentially I need to assign 
a minimum %age or minimum # - the lower of the two presumably?

Any comments welcome!

Thanks  best wishes Graeme