Re: [ccp4bb] Continuous-Single Versus Coarse-Multiple Sampling
No wikipedia link yet? JPK -Original Message- From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Keller, Jacob Sent: Thursday, January 22, 2015 5:20 PM To: CCP4BB@JISCMAIL.AC.UK Subject: [ccp4bb] Continuous-Single Versus Coarse-Multiple Sampling Dear Crystallographers, This is more general than crystallography, but has applications therein, particularly in understanding fine phi-slicing. The general question is: Given one needs to collect data to fit parameters for a known function, and given a limited total number of measurements, is it generally better to measure a small group of points multiple times or to distribute each individual measurement over the measureable extent of the function? I have a strong intuition that it is the latter, but all errors being equal, it would seem prima facie that both are equivalent. For example, a line (y = mx + b) can be fit from two points. One could either measure the line at two points A and B five times each for a total of 10 independent measurements, or measure ten points evenly-spaced from A to B. Are these equivalent in terms of fitting and information content or not? Which is better? Again, conjecture and intuition suggest the evenly-spaced experiment is better, but I cannot formulate or prove to myself why, yet. The application of this to crystallography might be another reason that fine phi-slicing (0.1 degrees * 3600 frames) is better than coarse (1 degree * 3600 frames), even though the number of times one measures reflections is tenfold higher in the second case (assuming no radiation damage). In the first case, one never measures the same phi angle twice, but one does have multiple measurements in a sense, i.e., of different parts of the same reflection. Yes, 3D profile-fitting may be a big reason fine phi-slicing works, but beyond that, perhaps this sampling choice plays a role as well. Or maybe the profile-fitting works so well precisely because of this diffuse-single type of sampling rather than coarse-multiple sampling? This general math/science concept must have been discussed somewhere--can anyone point to where? JPK *** Jacob Pearson Keller, PhD Looger Lab/HHMI Janelia Research Campus 19700 Helix Dr, Ashburn, VA 20147 email: kell...@janelia.hhmi.org ***
Re: [ccp4bb] Continuous-Single Versus Coarse-Multiple Sampling
On Friday, 23 January, 2015 20:36:08 Keller, Jacob wrote: No wikipedia link yet? In practice the issue usually comes down to a consideration of the noise. Is the noise systematic? Is the signal/noise constant over time? If the noise is systematic, then repeated measurement of the same data points will be biased by that systematic factor and will overestimate the accuracy of the data. This is an example of when precision is distinct from accuracy. On the other hand this protocol may allow you to better estimate and correct for a decrease in signal/noise as a function of time. Conversely spreading the measurement effort over a larger number of points may result in a larger apparent sigma (lower precision) but greater accuracy overall since the systematic effects are more likely to be correctly identified as noise. Relating this back to fine-slicing on phi, if the noise has a large per-image component then fine-slicing is a bad idea because you increase the noise for no gain in signal. On the other hand if there is little or no per-image noise component, as is the case for photon counting detectors, then fine-slicing potentially decreases the noise because you do not have to estimate and subtract the background from the portion of the time the reflection of interest does not intersect the Ewald sphere. Ethan JPK -Original Message- From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Keller, Jacob Sent: Thursday, January 22, 2015 5:20 PM To: CCP4BB@JISCMAIL.AC.UK Subject: [ccp4bb] Continuous-Single Versus Coarse-Multiple Sampling Dear Crystallographers, This is more general than crystallography, but has applications therein, particularly in understanding fine phi-slicing. The general question is: Given one needs to collect data to fit parameters for a known function, and given a limited total number of measurements, is it generally better to measure a small group of points multiple times or to distribute each individual measurement over the measureable extent of the function? I have a strong intuition that it is the latter, but all errors being equal, it would seem prima facie that both are equivalent. For example, a line (y = mx + b) can be fit from two points. One could either measure the line at two points A and B five times each for a total of 10 independent measurements, or measure ten points evenly-spaced from A to B. Are these equivalent in terms of fitting and information content or not? Which is better? Again, conjecture and intuition suggest the evenly-spaced experiment is better, but I cannot formulate or prove to myself why, yet. The application of this to crystallography might be another reason that fine phi-slicing (0.1 degrees * 3600 frames) is better than coarse (1 degree * 3600 frames), even though the number of times one measures reflections is tenfold higher in the second case (assuming no radiation damage). In the first case, one never measures the same phi angle twice, but one does have multiple measurements in a sense, i.e., of different parts of the same reflection. Yes, 3D profile-fitting may be a big reason fine phi-slicing works, but beyond that, perhaps this sampling choice plays a role as well. Or maybe the profile-fitting works so well precisely because of this diffuse-single type of sampling rather than coarse-multiple sampling? This general math/science concept must have been discussed somewhere--can anyone point to where? JPK *** Jacob Pearson Keller, PhD Looger Lab/HHMI Janelia Research Campus 19700 Helix Dr, Ashburn, VA 20147 email: kell...@janelia.hhmi.org *** -- Ethan A Merritt Biomolecular Structure Center, K-428 Health Sciences Bldg MS 357742, University of Washington, Seattle 98195-7742
Re: [ccp4bb] Continuous-Single Versus Coarse-Multiple Sampling
Hi Jacob, My intuition for the line fit case was exactly the opposite to yours. My reasoning is a sort of physical one. If you imagine the line as a stiff rod with hooks for masses evenly spread along its length at 10 positions, then the object where you put 5 masses at positions 1 and 10 has a greater moment of inertia than the case where there is one mass at each position. This tells me that changes to the masses in the further spread case would be more effective (faster) at changing the orientation of the rod. Then it's a bit of leap I admit, but I felt that measurements of a line's height made in such a way would be more effective at determining the fit parameters than the evenly spread case. But I decided not to rely on intuition when I could simulate it. At the bottom of the message is a script in the R language that does 2 line fits of the function y = m*x + c, where c = 100, m = 1 and x is the sequence 1..10 for the first 1 fits and (1, 1, 1, 1, 1, 10, 10, 10, 10, 10) for the rest of them. The data being fit to are 'measurements' taken by adding a standard normal deviate to x + 100 at each position. After running all these simulations (it takes a few seconds) the script then calculates the mean and standard deviations of the fit parameters, m and c. The standard deviations are interesting: mean intercept of fit1: 99.99619 sd of intercept of fit1 0.6833687 mean intercept of fit2: 99.99909 sd of intercept of fit2 0.498759 mean gradient of fit1: 1.00076 sd of gradient of fit1 0.1100296 mean gradient of fit2: 0.9996967 sd of gradient of fit2 0.07021113 Fit 2 has a tighter distribution for both the intercept and the gradient. It is therefore the more precise way of fitting the line, and this was the 'heavy-ended' case. Script follows: line_fit - function(x_seq) { y_seq - 100 + x_seq + rnorm(length(x_seq)) return(coef(lm(y_seq~x_seq))) } even_spaced_x - seq(1,10) heavy_ended_x - c(rep(1,5), rep(10,5)) fit1 - replicate(1, line_fit(even_spaced_x)) fit2 - replicate(1, line_fit(heavy_ended_x)) cat(mean intercept of fit1:, mean(fit1[1,]), \n) cat(sd of intercept of fit1, sd(fit1[1,]), \n\n) cat(mean intercept of fit2:, mean(fit2[1,]), \n) cat(sd of intercept of fit2, sd(fit2[1,]), \n\n) cat(mean gradient of fit1:, mean(fit1[2,]),\n) cat(sd of gradient of fit1, sd(fit1[2,]), \n\n) cat(mean gradient of fit2:, mean(fit2[2,]), \n) cat(sd of gradient of fit2, sd(fit2[2,]), \n\n) Cheers -- David On 22 January 2015 at 22:20, Keller, Jacob kell...@janelia.hhmi.org wrote: Dear Crystallographers, This is more general than crystallography, but has applications therein, particularly in understanding fine phi-slicing. The general question is: Given one needs to collect data to fit parameters for a known function, and given a limited total number of measurements, is it generally better to measure a small group of points multiple times or to distribute each individual measurement over the measureable extent of the function? I have a strong intuition that it is the latter, but all errors being equal, it would seem prima facie that both are equivalent. For example, a line (y = mx + b) can be fit from two points. One could either measure the line at two points A and B five times each for a total of 10 independent measurements, or measure ten points evenly-spaced from A to B. Are these equivalent in terms of fitting and information content or not? Which is better? Again, conjecture and intuition suggest the evenly-spaced experiment is better, but I cannot formulate or prove to myself why, yet. The application of this to crystallography might be another reason that fine phi-slicing (0.1 degrees * 3600 frames) is better than coarse (1 degree * 3600 frames), even though the number of times one measures reflections is tenfold higher in the second case (assuming no radiation damage). In the first case, one never measures the same phi angle twice, but one does have multiple measurements in a sense, i.e., of different parts of the same reflection. Yes, 3D profile-fitting may be a big reason fine phi-slicing works, but beyond that, perhaps this sampling choice plays a role as well. Or maybe the profile-fitting works so well precisely because of this diffuse-single type of sampling rather than coarse-multiple sampling? This general math/science concept must have been discussed somewhere--can anyone point to where? JPK *** Jacob Pearson Keller, PhD Looger Lab/HHMI Janelia Research Campus 19700 Helix Dr, Ashburn, VA 20147 email: kell...@janelia.hhmi.org ***
Re: [ccp4bb] Continuous-Single Versus Coarse-Multiple Sampling
Hi Jacob, It seems reasonable for your straight line case that the two would be equivalent. I would guess though that as the function to be fitted against becomes more complex (e.g. a rectangular hyperbola in enzyme kinetics) it would be advantageous to have more points of the independent variable measured a fewer number of times (assuming you measured each one enough to get a reasonable estimate of the experimental error). Another reason I can think of to measure more points less frequently is so you can detect possible deviations in the data from the expected function (there might be a peak in the middle of your straight line for some reason). Looking forward to reading other responses! Philip On Thu, Jan 22, 2015 at 5:20 PM, Keller, Jacob kell...@janelia.hhmi.org wrote: Dear Crystallographers, This is more general than crystallography, but has applications therein, particularly in understanding fine phi-slicing. The general question is: Given one needs to collect data to fit parameters for a known function, and given a limited total number of measurements, is it generally better to measure a small group of points multiple times or to distribute each individual measurement over the measureable extent of the function? I have a strong intuition that it is the latter, but all errors being equal, it would seem prima facie that both are equivalent. For example, a line (y = mx + b) can be fit from two points. One could either measure the line at two points A and B five times each for a total of 10 independent measurements, or measure ten points evenly-spaced from A to B. Are these equivalent in terms of fitting and information content or not? Which is better? Again, conjecture and intuition suggest the evenly-spaced experiment is better, but I cannot formulate or prove to myself why, yet. The application of this to crystallography might be another reason that fine phi-slicing (0.1 degrees * 3600 frames) is better than coarse (1 degree * 3600 frames), even though the number of times one measures reflections is tenfold higher in the second case (assuming no radiation damage). In the first case, one never measures the same phi angle twice, but one does have multiple measurements in a sense, i.e., of different parts of the same reflection. Yes, 3D profile-fitting may be a big reason fine phi-slicing works, but beyond that, perhaps this sampling choice plays a role as well. Or maybe the profile-fitting works so well precisely because of this diffuse-single type of sampling rather than coarse-multiple sampling? This general math/science concept must have been discussed somewhere--can anyone point to where? JPK *** Jacob Pearson Keller, PhD Looger Lab/HHMI Janelia Research Campus 19700 Helix Dr, Ashburn, VA 20147 email: kell...@janelia.hhmi.org ***
Re: [ccp4bb] Continuous-Single Versus Coarse-Multiple Sampling
Dear Jacob, There are a multitude of directions you can go from here - a few comments spring to mind: - are repeated observations of the same reflection on the same part of the detector really independent? - for the wide phi sliced vs. fine sliced discussion the detector type matters a great deal - a CCD has one set of properties, a PAD another - are we limited to megabytes? why not do both? i.e. 36,000 0.1 degree images? or if time limited are 360 1 degree images good enough?! - if you have a kappa goniostat are you better off re-orientating the sample between scans (so you are not repeating measurements in your graph fitting analogy) but you are getting [more] independent measurements or keeping the sample orientation the same so you are sampling the source signal (incoming X-ray beam) better? former probably allows better scaling, latter allows clearer radiation damage analysis - if you do have radiation damage (which clearly we do, even if it is radiation induced small changes) then which strategy would best allow you to go back retrospectively truncate your data set? your 10-spin approach wins out here... A really interesting question comes from your comment about the information content of the data - really what information is in there - are we talking about the absolute amount of information or just the amount of information we currently extract use? Profile fitting wise the ideal should (in principle) be to have a very small number of events on every image (exactly one?) from which you could recover the perfect time series of events have a relatively exact orientation for every one, and from this know or be able to assign whether each event comes from background or signal or signal from protein signal from solvent signal from air from a different point of view (or more likely probability that event E has class S). Only problem with this is that it does not work (yet!) From a practical point of view, I tend to like taking weak data many times from crystals when using a pixel array detector as the overhead / cost is small the option to revisit the data retrospectively stop collecting data is there. This is expensive in MB but these are cheap: given the choices above I would tend to select 10 x 3600 x 0.1 degree [not one of your two options I acknowledge] ;o) but that is just gut instinct. There are radiation damage arguments for doing the exact opposite too... Your final question, on the general problem - I suspect that statistically it has been well treated, but the fact that we record a convolution of reciprocal space with lots of other stuff (background, readout noise, source signal, ...) may make mapping MX data collection to this treatment a study in it's own right. I look forward to reading the rest of this thread! Best wishes Graeme On Thu Jan 22 2015 at 10:20:21 PM Keller, Jacob kell...@janelia.hhmi.org wrote: Dear Crystallographers, This is more general than crystallography, but has applications therein, particularly in understanding fine phi-slicing. The general question is: Given one needs to collect data to fit parameters for a known function, and given a limited total number of measurements, is it generally better to measure a small group of points multiple times or to distribute each individual measurement over the measureable extent of the function? I have a strong intuition that it is the latter, but all errors being equal, it would seem prima facie that both are equivalent. For example, a line (y = mx + b) can be fit from two points. One could either measure the line at two points A and B five times each for a total of 10 independent measurements, or measure ten points evenly-spaced from A to B. Are these equivalent in terms of fitting and information content or not? Which is better? Again, conjecture and intuition suggest the evenly-spaced experiment is better, but I cannot formulate or prove to myself why, yet. The application of this to crystallography might be another reason that fine phi-slicing (0.1 degrees * 3600 frames) is better than coarse (1 degree * 3600 frames), even though the number of times one measures reflections is tenfold higher in the second case (assuming no radiation damage). In the first case, one never measures the same phi angle twice, but one does have multiple measurements in a sense, i.e., of different parts of the same reflection. Yes, 3D profile-fitting may be a big reason fine phi-slicing works, but beyond that, perhaps this sampling choice plays a role as well. Or maybe the profile-fitting works so well precisely because of this diffuse-single type of sampling rather than coarse-multiple sampling? This general math/science concept must have been discussed somewhere--can anyone point to where? JPK *** Jacob Pearson Keller, PhD Looger Lab/HHMI Janelia Research Campus 19700 Helix Dr, Ashburn, VA 20147 email: kell...@janelia.hhmi.org
[ccp4bb] Continuous-Single Versus Coarse-Multiple Sampling
Dear Crystallographers, This is more general than crystallography, but has applications therein, particularly in understanding fine phi-slicing. The general question is: Given one needs to collect data to fit parameters for a known function, and given a limited total number of measurements, is it generally better to measure a small group of points multiple times or to distribute each individual measurement over the measureable extent of the function? I have a strong intuition that it is the latter, but all errors being equal, it would seem prima facie that both are equivalent. For example, a line (y = mx + b) can be fit from two points. One could either measure the line at two points A and B five times each for a total of 10 independent measurements, or measure ten points evenly-spaced from A to B. Are these equivalent in terms of fitting and information content or not? Which is better? Again, conjecture and intuition suggest the evenly-spaced experiment is better, but I cannot formulate or prove to myself why, yet. The application of this to crystallography might be another reason that fine phi-slicing (0.1 degrees * 3600 frames) is better than coarse (1 degree * 3600 frames), even though the number of times one measures reflections is tenfold higher in the second case (assuming no radiation damage). In the first case, one never measures the same phi angle twice, but one does have multiple measurements in a sense, i.e., of different parts of the same reflection. Yes, 3D profile-fitting may be a big reason fine phi-slicing works, but beyond that, perhaps this sampling choice plays a role as well. Or maybe the profile-fitting works so well precisely because of this diffuse-single type of sampling rather than coarse-multiple sampling? This general math/science concept must have been discussed somewhere--can anyone point to where? JPK *** Jacob Pearson Keller, PhD Looger Lab/HHMI Janelia Research Campus 19700 Helix Dr, Ashburn, VA 20147 email: kell...@janelia.hhmi.org ***