Re: [ccp4bb] Continuous-Single Versus Coarse-Multiple Sampling

2015-01-23 Thread Keller, Jacob
No wikipedia link yet?

JPK


-Original Message-
From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Keller, 
Jacob
Sent: Thursday, January 22, 2015 5:20 PM
To: CCP4BB@JISCMAIL.AC.UK
Subject: [ccp4bb] Continuous-Single Versus Coarse-Multiple Sampling

Dear Crystallographers,

This is more general than crystallography, but has applications therein, 
particularly in understanding fine phi-slicing.

The general question is:

Given one needs to collect data to fit parameters for a known function, and 
given a limited total number of measurements, is it generally better to measure 
a small group of points multiple times or to distribute each individual 
measurement over the measureable extent of the function? I have a strong 
intuition that it is the latter, but all errors being equal, it would seem 
prima facie that both are equivalent. For example, a line (y = mx + b) can be 
fit from two points. One could either measure the line at two points A and B 
five times each for a total of 10 independent measurements, or measure ten 
points evenly-spaced from A to B. Are these equivalent in terms of fitting and 
information content or not? Which is better? Again, conjecture and intuition 
suggest the evenly-spaced experiment is better, but I cannot formulate or prove 
to myself why, yet.

The application of this to crystallography might be another reason that fine 
phi-slicing (0.1 degrees * 3600 frames) is better than coarse (1 degree * 3600 
frames), even though the number of times one measures reflections is tenfold 
higher in the second case (assuming no radiation damage). In the first case, 
one never measures the same phi angle twice, but one does have multiple 
measurements in a sense, i.e., of different parts of the same reflection.

Yes, 3D profile-fitting may be a big reason fine phi-slicing works, but beyond 
that, perhaps this sampling choice plays a role as well. Or maybe the 
profile-fitting works so well precisely because of this diffuse-single type of 
sampling rather than coarse-multiple sampling?

This general math/science concept must have been discussed somewhere--can 
anyone point to where?

JPK

***
Jacob Pearson Keller, PhD
Looger Lab/HHMI Janelia Research Campus
19700 Helix Dr, Ashburn, VA 20147
email: kell...@janelia.hhmi.org
***


Re: [ccp4bb] Continuous-Single Versus Coarse-Multiple Sampling

2015-01-23 Thread Ethan A Merritt
On Friday, 23 January, 2015 20:36:08 Keller, Jacob wrote:
 No wikipedia link yet?

In practice the issue usually comes down to a consideration of the noise.
Is the noise systematic?  Is the signal/noise constant over time?  

If the noise is systematic, then repeated measurement of the same
data points will be biased by that systematic factor and will
overestimate the accuracy of the data.  This is an example of when
precision is distinct from accuracy. On the other hand this
protocol may allow you to better estimate and correct for a decrease
in signal/noise as a function of time.

Conversely spreading the measurement effort over a larger number
of points may result in a larger apparent sigma (lower precision)
but greater accuracy overall since the systematic effects are
more likely to be correctly identified as noise.

Relating this back to fine-slicing on phi, if the noise has a large
per-image component then fine-slicing is a bad idea because you 
increase the noise for no gain in signal.  On the other hand if
there is little or no per-image noise component, as is the case
for photon counting detectors, then fine-slicing potentially
decreases the noise because you do not have to estimate and subtract
the background from the portion of the time the reflection of 
interest does not intersect the Ewald sphere.

Ethan

 
 JPK
 
 
 -Original Message-
 From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Keller, 
 Jacob
 Sent: Thursday, January 22, 2015 5:20 PM
 To: CCP4BB@JISCMAIL.AC.UK
 Subject: [ccp4bb] Continuous-Single Versus Coarse-Multiple Sampling
 
 Dear Crystallographers,
 
 This is more general than crystallography, but has applications therein, 
 particularly in understanding fine phi-slicing.
 
 The general question is:
 
 Given one needs to collect data to fit parameters for a known function, and 
 given a limited total number of measurements, is it generally better to 
 measure a small group of points multiple times or to distribute each 
 individual measurement over the measureable extent of the function? I have a 
 strong intuition that it is the latter, but all errors being equal, it would 
 seem prima facie that both are equivalent. For example, a line (y = mx + b) 
 can be fit from two points. One could either measure the line at two points A 
 and B five times each for a total of 10 independent measurements, or measure 
 ten points evenly-spaced from A to B. Are these equivalent in terms of 
 fitting and information content or not? Which is better? Again, conjecture 
 and intuition suggest the evenly-spaced experiment is better, but I cannot 
 formulate or prove to myself why, yet.
 
 The application of this to crystallography might be another reason that fine 
 phi-slicing (0.1 degrees * 3600 frames) is better than coarse (1 degree * 
 3600 frames), even though the number of times one measures reflections is 
 tenfold higher in the second case (assuming no radiation damage). In the 
 first case, one never measures the same phi angle twice, but one does have 
 multiple measurements in a sense, i.e., of different parts of the same 
 reflection.
 
 Yes, 3D profile-fitting may be a big reason fine phi-slicing works, but 
 beyond that, perhaps this sampling choice plays a role as well. Or maybe the 
 profile-fitting works so well precisely because of this diffuse-single type 
 of sampling rather than coarse-multiple sampling?
 
 This general math/science concept must have been discussed somewhere--can 
 anyone point to where?
 
 JPK
 
 ***
 Jacob Pearson Keller, PhD
 Looger Lab/HHMI Janelia Research Campus
 19700 Helix Dr, Ashburn, VA 20147
 email: kell...@janelia.hhmi.org
 ***
-- 
Ethan A Merritt
Biomolecular Structure Center,  K-428 Health Sciences Bldg
MS 357742,   University of Washington, Seattle 98195-7742


Re: [ccp4bb] Continuous-Single Versus Coarse-Multiple Sampling

2015-01-23 Thread David Waterman
Hi Jacob,

My intuition for the line fit case was exactly the opposite to yours. My
reasoning is a sort of physical one. If you imagine the line as a stiff rod
with hooks for masses evenly spread along its length at 10 positions, then
the object where you put 5 masses at positions 1 and 10 has a greater
moment of inertia than the case where there is one mass at each position.
This tells me that changes to the masses in the further spread case would
be more effective (faster) at changing the orientation of the rod. Then
it's a bit of leap I admit, but I felt that measurements of a line's height
made in such a way would be more effective at determining the fit
parameters than the evenly spread case.

But I decided not to rely on intuition when I could simulate it. At the
bottom of the message is a script in the R language that does 2 line
fits of the function y = m*x + c, where c = 100, m = 1 and x is the
sequence 1..10 for the first 1 fits and (1, 1, 1, 1, 1, 10, 10, 10, 10,
10) for the rest of them. The data being fit to are 'measurements' taken by
adding a standard normal deviate to x + 100 at each position.

After running all these simulations (it takes a few seconds) the script
then calculates the mean and standard deviations of the fit parameters, m
and c. The standard deviations are interesting:

mean intercept of fit1: 99.99619
sd of intercept of fit1 0.6833687

mean intercept of fit2: 99.99909
sd of intercept of fit2 0.498759

mean gradient of fit1: 1.00076
sd of gradient of fit1 0.1100296

mean gradient of fit2: 0.9996967
sd of gradient of fit2 0.07021113

Fit 2 has a tighter distribution for both the intercept and the gradient.
It is therefore the more precise way of fitting the line, and this was the
'heavy-ended' case.

Script follows:

line_fit - function(x_seq)
{
  y_seq - 100 + x_seq + rnorm(length(x_seq))
  return(coef(lm(y_seq~x_seq)))
}

even_spaced_x - seq(1,10)
heavy_ended_x - c(rep(1,5), rep(10,5))

fit1 - replicate(1, line_fit(even_spaced_x))
fit2 - replicate(1, line_fit(heavy_ended_x))

cat(mean intercept of fit1:, mean(fit1[1,]), \n)
cat(sd of intercept of fit1, sd(fit1[1,]), \n\n)

cat(mean intercept of fit2:, mean(fit2[1,]), \n)
cat(sd of intercept of fit2, sd(fit2[1,]), \n\n)

cat(mean gradient of fit1:, mean(fit1[2,]),\n)
cat(sd of gradient of fit1, sd(fit1[2,]), \n\n)

cat(mean gradient of fit2:, mean(fit2[2,]), \n)
cat(sd of gradient of fit2, sd(fit2[2,]), \n\n)


Cheers

-- David

On 22 January 2015 at 22:20, Keller, Jacob kell...@janelia.hhmi.org wrote:

 Dear Crystallographers,

 This is more general than crystallography, but has applications therein,
 particularly in understanding fine phi-slicing.

 The general question is:

 Given one needs to collect data to fit parameters for a known function,
 and given a limited total number of measurements, is it generally better to
 measure a small group of points multiple times or to distribute each
 individual measurement over the measureable extent of the function? I have
 a strong intuition that it is the latter, but all errors being equal, it
 would seem prima facie that both are equivalent. For example, a line (y =
 mx + b) can be fit from two points. One could either measure the line at
 two points A and B five times each for a total of 10 independent
 measurements, or measure ten points evenly-spaced from A to B. Are these
 equivalent in terms of fitting and information content or not? Which is
 better? Again, conjecture and intuition suggest the evenly-spaced
 experiment is better, but I cannot formulate or prove to myself why, yet.

 The application of this to crystallography might be another reason that
 fine phi-slicing (0.1 degrees * 3600 frames) is better than coarse (1
 degree * 3600 frames), even though the number of times one measures
 reflections is tenfold higher in the second case (assuming no radiation
 damage). In the first case, one never measures the same phi angle twice,
 but one does have multiple measurements in a sense, i.e., of different
 parts of the same reflection.

 Yes, 3D profile-fitting may be a big reason fine phi-slicing works, but
 beyond that, perhaps this sampling choice plays a role as well. Or maybe
 the profile-fitting works so well precisely because of this diffuse-single
 type of sampling rather than coarse-multiple sampling?

 This general math/science concept must have been discussed somewhere--can
 anyone point to where?

 JPK

 ***
 Jacob Pearson Keller, PhD
 Looger Lab/HHMI Janelia Research Campus
 19700 Helix Dr, Ashburn, VA 20147
 email: kell...@janelia.hhmi.org
 ***



Re: [ccp4bb] Continuous-Single Versus Coarse-Multiple Sampling

2015-01-23 Thread Philip Kiser
Hi Jacob,

It seems reasonable for your straight line case that the two would be
equivalent. I would guess though that as the function to be fitted against
becomes more complex (e.g. a rectangular hyperbola in enzyme kinetics) it
would be advantageous to have more points of the independent variable
measured a fewer number of times (assuming you measured each one enough to
get a reasonable estimate of the experimental error). Another reason I can
think of to measure more points less frequently is so you can detect
possible deviations in the data from the expected function (there might be
a peak in the middle of your straight line for some reason). Looking
forward to reading other responses!

Philip

On Thu, Jan 22, 2015 at 5:20 PM, Keller, Jacob kell...@janelia.hhmi.org
wrote:

 Dear Crystallographers,

 This is more general than crystallography, but has applications therein,
 particularly in understanding fine phi-slicing.

 The general question is:

 Given one needs to collect data to fit parameters for a known function,
 and given a limited total number of measurements, is it generally better to
 measure a small group of points multiple times or to distribute each
 individual measurement over the measureable extent of the function? I have
 a strong intuition that it is the latter, but all errors being equal, it
 would seem prima facie that both are equivalent. For example, a line (y =
 mx + b) can be fit from two points. One could either measure the line at
 two points A and B five times each for a total of 10 independent
 measurements, or measure ten points evenly-spaced from A to B. Are these
 equivalent in terms of fitting and information content or not? Which is
 better? Again, conjecture and intuition suggest the evenly-spaced
 experiment is better, but I cannot formulate or prove to myself why, yet.

 The application of this to crystallography might be another reason that
 fine phi-slicing (0.1 degrees * 3600 frames) is better than coarse (1
 degree * 3600 frames), even though the number of times one measures
 reflections is tenfold higher in the second case (assuming no radiation
 damage). In the first case, one never measures the same phi angle twice,
 but one does have multiple measurements in a sense, i.e., of different
 parts of the same reflection.

 Yes, 3D profile-fitting may be a big reason fine phi-slicing works, but
 beyond that, perhaps this sampling choice plays a role as well. Or maybe
 the profile-fitting works so well precisely because of this diffuse-single
 type of sampling rather than coarse-multiple sampling?

 This general math/science concept must have been discussed somewhere--can
 anyone point to where?

 JPK

 ***
 Jacob Pearson Keller, PhD
 Looger Lab/HHMI Janelia Research Campus
 19700 Helix Dr, Ashburn, VA 20147
 email: kell...@janelia.hhmi.org
 ***



Re: [ccp4bb] Continuous-Single Versus Coarse-Multiple Sampling

2015-01-23 Thread Graeme Winter
Dear Jacob,

There are a multitude of directions you can go from here - a few comments
spring to mind:

 - are repeated observations of the same reflection on the same part of the
detector really independent?
 - for the wide phi sliced vs. fine sliced discussion the detector type
matters a great deal - a CCD has one set of properties, a PAD another
 - are we limited to megabytes? why not do both? i.e. 36,000 0.1 degree
images? or if time limited are 360 1 degree images good enough?!
 - if you have a kappa goniostat are you better off re-orientating the
sample between scans (so you are not repeating measurements in your graph
fitting analogy) but you are getting [more] independent measurements or
keeping the sample orientation the same so you are sampling the source
signal (incoming X-ray beam) better? former probably allows better scaling,
latter allows clearer radiation damage analysis
 - if you do have radiation damage (which clearly we do, even if it is
radiation induced small changes) then which strategy would best allow you
to go back  retrospectively truncate your data set? your 10-spin approach
wins out here...

A really interesting question comes from your comment about the information
content of the data - really what information is in there - are we talking
about the absolute amount of information or just the amount of information
we currently extract  use?

Profile fitting wise the ideal should (in principle) be to have a very
small number of events on every image (exactly one?) from which you could
recover the perfect time series of events  have a relatively exact
orientation for every one, and from this know or be able to assign whether
each event comes from background or signal or signal from protein
signal from solvent signal from air from a different point of view (or
more likely probability that event E has class S). Only problem with this
is that it does not work (yet!)

From a practical point of view, I tend to like taking weak data many times
from crystals when using a pixel array detector as the overhead / cost is
small  the option to revisit the data  retrospectively stop collecting
data is there. This is expensive in MB but these are cheap: given the
choices above I would tend to select 10 x 3600 x 0.1 degree [not one of
your two options I acknowledge] ;o) but that is just gut instinct. There
are radiation damage arguments for doing the exact opposite too...

Your final question, on the general problem - I suspect that statistically
it has been well treated, but the fact that we record a convolution of
reciprocal space with lots of other stuff (background, readout noise,
source signal, ...) may make mapping MX data collection to this treatment a
study in it's own right.

I look forward to reading the rest of this thread!

Best wishes Graeme








On Thu Jan 22 2015 at 10:20:21 PM Keller, Jacob kell...@janelia.hhmi.org
wrote:

 Dear Crystallographers,

 This is more general than crystallography, but has applications therein,
 particularly in understanding fine phi-slicing.

 The general question is:

 Given one needs to collect data to fit parameters for a known function,
 and given a limited total number of measurements, is it generally better to
 measure a small group of points multiple times or to distribute each
 individual measurement over the measureable extent of the function? I have
 a strong intuition that it is the latter, but all errors being equal, it
 would seem prima facie that both are equivalent. For example, a line (y =
 mx + b) can be fit from two points. One could either measure the line at
 two points A and B five times each for a total of 10 independent
 measurements, or measure ten points evenly-spaced from A to B. Are these
 equivalent in terms of fitting and information content or not? Which is
 better? Again, conjecture and intuition suggest the evenly-spaced
 experiment is better, but I cannot formulate or prove to myself why, yet.

 The application of this to crystallography might be another reason that
 fine phi-slicing (0.1 degrees * 3600 frames) is better than coarse (1
 degree * 3600 frames), even though the number of times one measures
 reflections is tenfold higher in the second case (assuming no radiation
 damage). In the first case, one never measures the same phi angle twice,
 but one does have multiple measurements in a sense, i.e., of different
 parts of the same reflection.

 Yes, 3D profile-fitting may be a big reason fine phi-slicing works, but
 beyond that, perhaps this sampling choice plays a role as well. Or maybe
 the profile-fitting works so well precisely because of this diffuse-single
 type of sampling rather than coarse-multiple sampling?

 This general math/science concept must have been discussed somewhere--can
 anyone point to where?

 JPK

 ***
 Jacob Pearson Keller, PhD
 Looger Lab/HHMI Janelia Research Campus
 19700 Helix Dr, Ashburn, VA 20147
 email: kell...@janelia.hhmi.org
 

[ccp4bb] Continuous-Single Versus Coarse-Multiple Sampling

2015-01-22 Thread Keller, Jacob
Dear Crystallographers,

This is more general than crystallography, but has applications therein, 
particularly in understanding fine phi-slicing.

The general question is:

Given one needs to collect data to fit parameters for a known function, and 
given a limited total number of measurements, is it generally better to measure 
a small group of points multiple times or to distribute each individual 
measurement over the measureable extent of the function? I have a strong 
intuition that it is the latter, but all errors being equal, it would seem 
prima facie that both are equivalent. For example, a line (y = mx + b) can be 
fit from two points. One could either measure the line at two points A and B 
five times each for a total of 10 independent measurements, or measure ten 
points evenly-spaced from A to B. Are these equivalent in terms of fitting and 
information content or not? Which is better? Again, conjecture and intuition 
suggest the evenly-spaced experiment is better, but I cannot formulate or prove 
to myself why, yet.

The application of this to crystallography might be another reason that fine 
phi-slicing (0.1 degrees * 3600 frames) is better than coarse (1 degree * 3600 
frames), even though the number of times one measures reflections is tenfold 
higher in the second case (assuming no radiation damage). In the first case, 
one never measures the same phi angle twice, but one does have multiple 
measurements in a sense, i.e., of different parts of the same reflection.

Yes, 3D profile-fitting may be a big reason fine phi-slicing works, but beyond 
that, perhaps this sampling choice plays a role as well. Or maybe the 
profile-fitting works so well precisely because of this diffuse-single type of 
sampling rather than coarse-multiple sampling?

This general math/science concept must have been discussed somewhere--can 
anyone point to where?

JPK

***
Jacob Pearson Keller, PhD
Looger Lab/HHMI Janelia Research Campus
19700 Helix Dr, Ashburn, VA 20147
email: kell...@janelia.hhmi.org
***