Re: [R] Mixture of Normals with Large Data

Bert Gunter Tue, 07 Aug 2007 15:51:04 -0700

Why would anyone want to fit a mixture of normals with 110 million
observations?? Any questions about the distribution that you would care to
ask can be answered directly from the data. Of course, any test of normality
(or anything else) would be rejected.


More to the point, the data are certainly not a random sample of anything.
There will be all kinds of systematic nonrandom structure in them. This is
clearly a situation where the researcher needs to think more carefully about
the substantive questions of interest and how the data may shed light on
them, instead of arbitrarily and perhaps reflexively throwing some silly
statistical methodology at them.  

Bert Gunter
Genentech Nonclinical Statistics

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Tim Victor
Sent: Tuesday, August 07, 2007 3:02 PM
To: r-help@stat.math.ethz.ch
Subject: Re: [R] Mixture of Normals with Large Data

I wasn't aware of this literature, thanks for the references.

On 8/5/07, RAVI VARADHAN <[EMAIL PROTECTED]> wrote:
> Another possibility is to use "data squashing" methods.  Relevant papers
are: (1) DuMouchel et al. (1999), (2) Madigan et al. (2002), and (3) Owen
(1999).
>
> Ravi.
> ____________________________________________________________________
>
> Ravi Varadhan, Ph.D.
> Assistant Professor,
> Division of Geriatric Medicine and Gerontology
> School of Medicine
> Johns Hopkins University
>
> Ph. (410) 502-2619
> email: [EMAIL PROTECTED]
>
>
> ----- Original Message -----
> From: "Charles C. Berry" <[EMAIL PROTECTED]>
> Date: Saturday, August 4, 2007 8:01 pm
> Subject: Re: [R] Mixture of Normals with Large Data
> To: [EMAIL PROTECTED]
> Cc: r-help@stat.math.ethz.ch
>
>
> > On Sat, 4 Aug 2007, Tim Victor wrote:
> >
> >  > All:
> >  >
> >  > I am trying to fit a mixture of 2 normals with > 110 million
> > observations. I
> >  > am running R 2.5.1 on a box with 1gb RAM running 32-bit windows and
> > I
> >  > continue to run out of memory. Does anyone have any suggestions.
> >
> >
> >  If the first few million observations can be regarded as a SRS of the
> >
> >  rest, then just use them. Or read in blocks of a convenient size and
> >
> >  sample some observations from each block. You can repeat this process
> > a
> >  few times to see if the results are sufficiently accurate.
> >
> >  Otherwise, read in blocks of a convenient size (perhaps 1 million
> >  observations at a time), quantize the data to a manageable number of
> >
> >  intervals - maybe a few thousand - and tabulate it. Add the counts
> > over
> >  all the blocks.
> >
> >  Then use mle() to fit a multinomial likelihood whose probabilities
> > are the
> >  masses associated with each bin under a mixture of normals law.
> >
> >  Chuck
> >
> >  >
> >  > Thanks so much,
> >  >
> >  > Tim
> >  >
> >  >    [[alternative HTML version deleted]]
> >  >
> >  > ______________________________________________
> >  > R-help@stat.math.ethz.ch mailing list
> >  >
> >  > PLEASE do read the posting guide
> >  > and provide commented, minimal, self-contained, reproducible code.
> >  >
> >
> >  Charles C. Berry                            (858) 534-2098
> >                                               Dept of
> > Family/Preventive Medicine
> >  E                     UC San Diego
> >    La Jolla, San Diego 92093-0901
> >
> >  ______________________________________________
> >  R-help@stat.math.ethz.ch mailing list
> >
> >  PLEASE do read the posting guide
> >  and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Mixture of Normals with Large Data

Reply via email to