srinivas wrote:
>
> Hi,
>
> I have a problem in identifying the right multivariate tools to
> handle datset of dimension 1,00,000*500. The problem is still
> complicated with lot of missing data. can anyone suggest a way out to
> reduce the data set and also to estimate the missing value. I need to
> know which clustering tool is appropriate for grouping the
> observations( based on 500 variables ).
This may not be the answer to your question, but clearly you need a
good statistical package that would allow you to manipulate the data
in ways that make sense and that would allow you to devise
simplification strategies appropriate in context. I recently went
through a similar exercise, smaller than yours, but still complex ...
approx. 5,000 cases by 65 variables. I used the statistical package
R, and I can tell you it was a god-send. In previous incarnations
(more than 10 years ago) I had used at various times (which varied
with employer) BDMS, SAS, SPSS, and S. I had liked S best of the lot
because of the advantages I found in the Unix environment. Nowadays,
I have Linux on the desktop, and looked for the package closest to S
in spirit, which turned out to be R. That it is freeware was a bonus.
That it is a fully extensible programming language in its own right
gave me everything I needed, as I tend to "roll my own" when I do
statistical analysis, combining elements of possibilistic analysis of
the likelihood function derived from fuzzy set theory. At any rate,
if that was indeed your question, and if you're on a tight budget, I
would say get a Linux box (a fast one, with lots of RAM and hard disk
space) and download a copy of R, and start with the graphing tools
that allow you as a first step to "look at" the data. Sensible ways
of grouping and simplifying will suggest themselves to you, and
inevitably thereafter you'll want to fit some regression models
and/or do some analysis of variance. If you're *not* on a tight
budget, and/or you have access to a fancy workstation, then you might
also have access to your choice of expensive stats packages. If I
were you, I would still opt for R, essentially because of its
programmability, which in my recent work I found to be indispensable.
Hope this is of help. Good luck.
S. F. Thomas
=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
http://jse.stat.ncsu.edu/
=================================================================