Thank you for the comments. They are exactly the concerns I had when I
encountered it. It seems that without any further assumptions either
on the distribution or on the weights (sizes) of the samples, I can
not use any parametric method to approach the problem.

However, the weights are something I try to derive. Then the
distribution function is the only thing left to manipulate. The "large
sample" I have seems to be a Poisson ( the data can be treated either
as discrete or as continuous, in nature it is discrete because it is
the count of days). If I can make further assumption of the
distribution of the unobserved sample, then I think the finite mixture
model method will take care the estimations.

So without these assumptions, I plan to choose a clustering method to
separate the bimodal sample into two. This will certainly lose some
information and may have mis-classification since, as you mentioned in
4), there are overlapping ("superimposed") in the whole range.  And it
seems that I don't need anything from the other "large sample".

Then, on the other hand, with the addtional information for the "large
sample", can I somehow adjust the clustering process to make it
perform better?

James


[EMAIL PROTECTED] (Donald Burrill) wrote in message news:<[EMAIL PROTECTED]>...
> On 4 Mar 2003, James wrote (edited):
> 
> > I have a dataset (positive count data) with bimodal shape.
> >  My theory is that the sample is composed of samples drawn from two
> > different populations.  I have another large dataset that I assume is
> > a sample of one of the subpopulation.  The data again is counts of
> > occurrences (all positive and positively skewed, I wouldn't want to
> > assume any distributions at this time).
> 
> By "all positive", do you mean only "no zero counts", or "no values of
> zero in the data"?  If the latter, is that also true of your bimodal
> data set?  Is it reasonable that no zero values had been observed, or
> may such values have been excised from the data (with or without malice
> aforethought)?
> 
> > My question is: can I derive some information about the other
> > subpopulation like mean and variance without further assumptions on
> > distributions of the populations or weights of how the bimodal
> > sample drawn from the two populations?
> 
> I do not see how to do this easily.
>  1)  Your data are counts.  Unless you have reduced them to, say,
> proportions, the mean and variance will depend on the sample size.
> Some such reduction would be necessary, I suppose, even to compare
> values from your "other large dataset" with those from your bimodal
> sample.
>  2)  Do you have any idea what might induce bimodality (e.g., the cample
> contains data from males and females, and it is reasonable to observe a
> systematic difference in mean number (or proportion) of counts for this
> variable), and can you segregate the data on the basis of such an
> identifier?  (One suspects not, or you'd have mentioned it:  the problem
> would be much easier from that approach.)
>  3)  From (mean, variance, sample size) for two subsamples one can
> readily find (mean, variance) for the data set obtained by combining the
> two.  To do it the other way round, you still need the two sample sizes:
> so you'd need to guess the ratio of sample sizes (which is what I
> suppose you meant by referring to "weights").
>  4)  It would be possible to make a start if you thought you knew the
> means of both subpopulations (and the variance of one of them, as you
> claim to have).  But so far as I can see, your only information about
> different means arises from observing different modes;  and you cannot
> estimate a mean from a mode (quite apart from the unreliability of modes
> as measures, in general!) without assuming something about the shape(s)
> of the distribution(s).  This *might* be possible, given your second
> dataset and assuming that the two distributions you posit were similar
> in this regard.  But then there's the problem that the modes you can
> observe are influenced by the presence of the other subsample:  in
> particular, given the skewness you mention, the upper mode may have been
> shifted somewhat to the right, due to its distribution's having been
> superimposed (so to speak) on a sloping surface.
>  [As you know, everyone is always looking for a level playing field...]
> 
> I don't know if these thoughts will have been helpful.  Good luck!
>  -----------------------------------------------------------------------
>  Donald F. Burrill                                            [EMAIL PROTECTED]
>  56 Sebbins Pond Drive, Bedford, NH 03110                 (603) 626-0816
> 
> .
> .
> =================================================================
> Instructions for joining and leaving this list, remarks about the
> problem of INAPPROPRIATE MESSAGES, and archives are available at:
> .                  http://jse.stat.ncsu.edu/                    .
> =================================================================
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Reply via email to