RE: [R] Rd Files?

2003-12-03 Thread Philippe Grosjean
Wolski [EMAIL PROTECTED] wrotes:
 I have seen the output and it does not matter to me anymore if prompt
 or package.skeleton works on any platform. I hope it wasn't a too big
 heresy.  If someone would ask me what are the week point of R, then
 the only one that pops up immediately, is that the documentation to
 functions have to be stored in a separate file than the code.  I am a
 big R/S fan.  But its a pity that comments above or below the function
 declaration are not recognized by the help system. Therefore prompt

Toni Rossini answered:
Doug Bates commented on the possibility of patching Doxygen to do
this, once long ago.  Not sure if anyone took it anywhere, though.
It's a reasonable system for assisting with documentation in a number
of languages, though it could be improved.  That would be nice, since
then one would get C docs for free.

Another alternative is to write a noweb lit-prog file, and then generate
your package via noweb (NOT Sweave, though you get double duty, since
if it's written right, you can stick the original doc in as a
vignette).

Well, writing a quick and durty help for a function with a few lines of
comment above or below the function code (a la Matlab) should be nice. I
don't think that it should be a good idea to provide a complex alternative
solution for documenting the functions than the current mechanism which is
both powerful and efficient (but, of course, a little bit complex). Here is
a quick and durty implementation of a mechanism to include quick and
durty help messages inside the code of an R function. I guess this is
enough.

qhelp - function(topic) {
if (is.character(topic))
topic - get(topic)
if (!is.function(topic))
stop(`topic` must be a function, or the name of a function)
fcode - sub(, , deparse(topic)) # Because 4 spaces are added by
deparse
# Look for quick help text, that is, strings starting with `#`
qhlp - fcode[grep(^\\#, fcode)]
qhlp - as.character(parse(text=qhlp))
cat(paste(qhlp, \n, sep=), sep=)
return(invisible())

# Quick help
# `qhelp()` provides a mechanism to include \quick help\
# embedded inside the code of an R function.
#
# Just end the function code with return(res)
# and add some strings starting with `#` after it
# with the content of your quick help message...
}

# An example of a very simple function with quick help
cube - function(x) {
# This is some comment that will appear only when I print the
function...
return(x^3)

# Quick help
# `cube(x)` returns the cube of its `x` argument
# Version 0.1, by Ph. Grosjean ([EMAIL PROTECTED])
}

qhelp(cube)# Should return quick help
qhelp(qhelp) # Strings also allowed for `topic` argument
qhelp(log)   # No quick help, should print just an empty lines

Best,

Philippe

...](({?...?}))...
 ) ) ) ) )
( ( ( ( (   Prof. Philippe Grosjean
 ) ) ) ) )
( ( ( ( (   Numerical Ecology Laboratory
 ) ) ) ) )  Mons-Hainaut University
( ( ( ( (   8, Av. du Champ de Mars, 7000 Mons, Belgium
 ) ) ) ) )
( ( ( ( (   phone: 00-32-65.37.34.97
 ) ) ) ) )  email: [EMAIL PROTECTED]; [EMAIL PROTECTED]
( ( ( ( (   SciViews project coordinator (http://www.sciviews.org)
 ) ) ) ) )
...

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] Rd Files?

2003-12-03 Thread A.J. Rossini
Philippe Grosjean [EMAIL PROTECTED] writes:

 Well, writing a quick and durty help for a function with a few lines of
 comment above or below the function code (a la Matlab) should be nice. I
 don't think that it should be a good idea to provide a complex alternative
 solution for documenting the functions than the current mechanism which is
 both powerful and efficient (but, of course, a little bit complex). Here is
 a quick and durty implementation of a mechanism to include quick and
 durty help messages inside the code of an R function. I guess this is
 enough.

That's a nice quicky and dirty solution.  Works in simple cases, but
fails the works in all cases.  But a 90-percent solution is probably
enough for the task at hand, especially for software limited to
individual deployment.

However, note that it's the basic idea behind the Doxygen framework, 
which does a more robust job of parsing and documenting.

best,
-tony


-- 
[EMAIL PROTECTED]http://www.analytics.washington.edu/ 
Biomedical and Health Informatics   University of Washington
Biostatistics, SCHARP/HVTN  Fred Hutchinson Cancer Research Center
UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable
FHCRC  (M/W): 206-667-7025 FAX=206-667-4812 | use Email

CONFIDENTIALITY NOTICE: This e-mail message and any attachme...{{dropped}}

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


[R] Simulating correlated distributions

2003-12-03 Thread Coomaren Vencatasawmy
Hi
 
How can one simulate correlated distributions in R for windows?
 
Coomaren P. Vencatasawmy
 


-
Download Yahoo! Messenger now for a chance to WIN Robbie Williams Live At Knebworth 
DVD
[[alternative HTML version deleted]]

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


[R] Re: question regarding variance components

2003-12-03 Thread Federico Calboli
Assuming you are measuring Y and you have factor A fixed and factor B
random, I would create a model like:

mod-lme(Y ~ A, random=~1|B/A, mydata)
VarCorr(mod1)

the term random=~1|B tells the model that B is a random factor, adding
the /A to get random =~1|B/A tells the model you want the
interaction between the fixed and random factors.

VarCorr gives you the variance components of the model.

All is answered much better (and with examples) in Pinheiro and Bates
2000 (it's in the first chapter) and in Crawley 2002.

I have posted a question similar to yours times ago, and got an
excellent reply from Prof Bates; search the archives for it.


If ALL your factors are random try something:

mod-lme(Y~1,random=~1|A/B, mydata)
VarCorr(mod)

but here I am more guessing than anything. Get Pinheiro and Bates 2000
for this.


Cheers,

Federico


-- 



=

Federico C. F. Calboli

PLEASE NOTE NEW ADDRESS

Dipartimento di Biologia
Via Selmi 3
40126 Bologna
Italy

tel (+39) 051 209 4187
fax (+39) 051 251 208

f.calboli at ucl.ac.uk

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


[R] amap : hclust agglomeration

2003-12-03 Thread Finnie, Thomas
Hi,
I'm trying to understand the complete linkage method in hclust. Can anyone provide a 
breakdown of the formula (p9 of the pdf documentation) or tell me what the sup 
operator does/means?

thanks in advance

Tom

[[alternative HTML version deleted]]

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] lme: reproducing example

2003-12-03 Thread Karl Knoblick
Thanks!
I think the minor differences taking the values with
rnorm result of the homogen distribution without an
effect. But the results of aov and lme should be
similiar for data with effects, too (at least for
simple and balanced designs).

Karl

 --- Pascal A. Niklaus [EMAIL PROTECTED]
schrieb:  Karl Knoblick wrote:
 
 Dear R-community!
 
 I still have the problem reproducing the following
 example using lme.
 
 id-factor(rep(rep(1:5,rep(3,5)),3))
 factA - factor(rep(c(a1,a2,a3),rep(15,3)))
 factB - factor(rep(c(B1,B2,B3),15))
 Y-numeric(length=45)
 Y[ 1: 9]-c(56,52,48,57,54,46,55,51,51)
 Y[10:18]-c(58,51,50,54,53,46,54,50,49)
 Y[19:27]-c(53,49,48,56,48,52,52,52,50)
 Y[28:36]-c(55,51,46,57,49,50,55,51,47)
 Y[37:45]-c(56,48,51,58,50,48,58,46,52)
 df-data.frame(id, factA, factB, Y) 
 df.aov - aov(Y ~ factA*factB + Error(factA:id),
 data=df)
 summary(df.aov)
 
 Is there a way to get the same results with lme as 
 with aov with Error()? HOW???
 
 One idea was the following:

df$factAid=factor(paste(as.character(df$factA),:,as.character(df$id),sep=))
 df.lme -

lme(Y~factA*factB,df,random=~1|factAid,method=REML)
 
 The degrees of freedom look right, but the F values
 don't match aov.
 
 Hope somebody can help! Thanks!!
 
 Karl
   
 
 Hmmm, strange, it works if I use factB:id as plot...
 it also works when 
 I use factA:id as plot and replace your Y's by
 random numbers... is this 
 a problem with convergence?
 
 Pascal
 
 
   df$Y=rnorm(45)
   summary(aov(Y ~ factB*factA +
 Error(id:factA),data=df))
 
 Error: id:factA
   Df  Sum Sq Mean Sq F value Pr(F)
 factA  2  2.9398  1.4699  0.9014 0.4318
 Residuals 12 19.5675  1.6306
 
 Error: Within
 Df  Sum Sq Mean Sq F value   Pr(F)
 factB2  7.1431  3.5716  7.4964 0.002956 **
 factB:factA  4  4.2411  1.0603  2.2254 0.096377 .
 Residuals   24 11.4345  0.4764
 ---
 Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.'
 0.1 ` ' 1
 
   anova(lme(Y ~ factB*factA ,data=df, random = ~ 1
 | plot))
 numDF denDF  F-value p-value
 (Intercept) 124 0.014294  0.9058
 factB   224 7.496097  0.0030
 factA   212 0.901489  0.4318
 factB:factA 424 2.225317  0.0964
 
 Pascal
 
 
   summary(aov(Y ~ factA*factB + Error(factB:id)))
 
 Error: factB:id
   Df Sum Sq Mean Sq F valuePr(F)
 factB  2 370.71  185.36  51.488 1.293e-06 ***
 Residuals 12  43.203.60
 ---
 Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.'
 0.1 ` ' 1
 
 Error: Within
 Df Sum Sq Mean Sq F value  Pr(F)
 factA2  9.911   4.956  1.6248 0.21788
 factA:factB  4 45.556  11.389  3.7341 0.01686 *
 Residuals   24 73.200   3.050
 
   df$plot - factor(paste(df$factB,df$id))
   anova(lme(Y ~ factB*factA , data=df, random = ~1
 | plot))
 numDF denDF  F-value p-value
 (Intercept) 124 33296.02  .0001
 factB   21251.47  .0001
 factA   224 1.63  0.2178
 factB:factA 424 3.73  0.0168
  


__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] Simulating correlated distributions

2003-12-03 Thread Duncan Murdoch
On Wed, 3 Dec 2003 10:08:04 + (GMT), you wrote:

Hi
 
How can one simulate correlated distributions in R for windows?

I'm not sure exactly what you're asking, but maybe the MASS function
mvrnorm() is what you want.

Duncan Murdoch

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


[R] Changing Colors

2003-12-03 Thread Lars Peters
Hello,

I've got a big problem. I'm using R for geostatistical analyses, especially
the field-package.
I try to generate plots after the kriging process with help of
image.plot(..., col=terrain.colors, ...). Everything works fine, but I want
to reverse the color-palettes (heat.colors, topo.colors or gray()) to get
darkest colors at highest data-values instead the other way round.

Could anyone give me hints or some syntax to resolve that problem??


Thanks and best regards,

Lars Peters


-
Lars Peters

University of Konstanz
Limnological Institute
D-78457 Konstanz
Germany

phone: +49 (0)7531 88-2930
fax:   +49 (0)7531 88-3533
e-mail: [EMAIL PROTECTED]
http://www.uni-konstanz.de/sfb454/tp_eng/A1/doc/peters/peters.html
http://www.uni-konstanz.de/sfb454/tp_eng/A1/index.htm

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] Changing Colors

2003-12-03 Thread Roger Bivand
On Wed, 3 Dec 2003, Lars Peters wrote:

 Hello,
 
 I've got a big problem. I'm using R for geostatistical analyses, especially
 the field-package.
 I try to generate plots after the kriging process with help of
 image.plot(..., col=terrain.colors, ...). Everything works fine, but I want
 to reverse the color-palettes (heat.colors, topo.colors or gray()) to get
 darkest colors at highest data-values instead the other way round.
 
 Could anyone give me hints or some syntax to resolve that problem??

rev()?

 
 
 Thanks and best regards,
 
 Lars Peters
 
 
 -
 Lars Peters
 
 University of Konstanz
 Limnological Institute
 D-78457 Konstanz
 Germany
 
 phone: +49 (0)7531 88-2930
 fax:   +49 (0)7531 88-3533
 e-mail: [EMAIL PROTECTED]
 http://www.uni-konstanz.de/sfb454/tp_eng/A1/doc/peters/peters.html
 http://www.uni-konstanz.de/sfb454/tp_eng/A1/index.htm
 
 __
 [EMAIL PROTECTED] mailing list
 https://www.stat.math.ethz.ch/mailman/listinfo/r-help
 

-- 
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Breiviksveien 40, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 93 93
e-mail: [EMAIL PROTECTED]

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] Vector Assignments

2003-12-03 Thread Arend P. van der Veen
Your recommendations have worked great.  I have found both cut and
ifelse to be useful.

I have one more question. When should I use factors over a character
vector.  I know that they have different uses.  However, I am still
trying to figure out how I can best take advantage of factors. 

The following is what I am really trying to do:

colors - c(red,blue,green,black)
y.col - colors[cut(y,c(-Inf,250,500,700,Inf),right=F,lab=F)]
plot(x,y,col=y.col)

Would using factors make this any cleaner?  I think a character vector
is all I need but I thought I would ask.

Thanks for your help,
Arend van der Veen



On Tue, 2003-12-02 at 00:32, Gabor Grothendieck wrote:
 And one other thing.  Are you sure you want character variables
 as the result of all this?  A column whose entries are each one
 of four colors seems like a good job for a factor:
 
 colours - c(red, blue, green,black)
 cut(x, c(-Inf,250,500,700,Inf),right=F,lab=colours)
 
 
 
 ---
 Date: Mon, 1 Dec 2003 23:47:39 -0500 (EST) 
 From: Gabor Grothendieck [EMAIL PROTECTED]
 To: [EMAIL PROTECTED], [EMAIL PROTECTED] 
 Cc: [EMAIL PROTECTED] 
 Subject: Re: [R] Vector Assignments 
 
  
 
 
 
 Just some small refinements/corrections:
 
 colours - c(red, blue, green,back)
 colours[cut(x, c(-Inf,250,500,700,Inf),right=F,lab=F)]
 
 ---
 Date: Tue, 02 Dec 2003 14:38:55 +1300 
 From: Hadley Wickham [EMAIL PROTECTED]
 To: Arend P. van der Veen [EMAIL PROTECTED] 
 Cc: R HELP [EMAIL PROTECTED] 
 Subject: Re: [R] Vector Assignments 
 
 
 
 One way would be to create a vector of colours and then cut() to index 
 the vector:
 
 colours - c(red, blue, green,back)
 colours[cut(x, c(min(x),250,500,700,max(x)),lab=F)]
 
 Hadley
 
 
 Arend P. van der Veen wrote:
 
 Hi,
 
 I have simple R question. 
 
 I have a vector x that contains real numbers. I would like to create
 another vector col that is the same length of x such that:
 
 if x[i]  250 then col[i] = red
 else if x[i]  500 then col[i] = blue
 else if x[i]  750 then col[i] = green
 else col[i] = black for all i
 
 I am convinced that there is probably a very efficient way to do this in
 R but I am not able to figure it out. Any help would be greatly
 appreciated.
 
 Thanks in advance,
 Arend van der Veen
 
 
 
 ___
 No banners. No pop-ups. No kidding.
 Introducing My Way - http://www.myway.com


__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


[R] multidimensional Fisher or Chi square test

2003-12-03 Thread Arne.Muller
Hello,

Is there a test for independence available based on a multidimensional
contingency table?

I've about 300 processes, and for each of them I get numbers for failures and
successes. I've two or more conditions under which I test these processes.

If I had just one process to test I could just perform a fisher or chisquare
test on a 2x2 contigency table, like this:

for one process:
conditionA  conditionB
ok  20  6
failed  190 156

From the table I can figure out if the outcome (ok/failed) is bound to one of
the conditions for a process. However, I'd like to know how different the 2
conditions are from each other considering all 300 processes, and I consider
the processes to be an additional dimension. 

My H0 is that both conditions are overall (considering all processes) the
same.

Could you give me a hint what kind of test of package I should look into?

kind regars + thanks for your help,

Arne

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


[R] intraclass correlation

2003-12-03 Thread Veronique Verhoeven
Hi,


Can R calculate an intraclass correlation coefficient for clustered data,
when the outcome variable is dichotomous?
By now I calculate it by hand, estimating between- and intracluster variance
by one-way ANOVA - however I don't feel very comfortable about this, since
the distributional assumptions are not really met
Maybe anyone can help me?

Best regards and many many thanks,

Veronique Verhoeven
University of Antwerp

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] reason for Factors -- was -- Vector Assignments

2003-12-03 Thread Thomas W Blackwell

On Wed, 3 Dec 2003, Arend P. van der Veen wrote:

 Your recommendations have worked great.  I have found both cut and
 ifelse to be useful.

 I have one more question. When should I use factors over a character
 vector.  I know that they have different uses.  However, I am still
 trying to figure out how I can best take advantage of factors.

 The following is what I am really trying to do:

 colors - c(red,blue,green,black)
 y.col - colors[cut(y,c(-Inf,250,500,700,Inf),right=F,lab=F)]
 plot(x,y,col=y.col)

 Would using factors make this any cleaner?  I think a character vector
 is all I need but I thought I would ask.

 Thanks for your help,
 Arend van der Veen

Arend  -

When setting the colors of plotted points, you definitely want
a vector of character strings as the color names.  Factor was
invented so that regression and analysis of variance functions
would properly recognize a grouping variable and not fit simply
a linear coefficient to the integer codes.  In the context of a
linear (or similar) model, each factor or interaction has to be
expanded from a single column of integer codes into a matrix of
[0,1] indicator variables, with a separate column for each possible
level of the factor.  (I oversimplify a bit here: some columns
are omitted, to keep the design matrix from being over-specified.)

-  tom blackwell  -  u michigan medical school  -  ann arbor  -

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


[R] non-uniqueness in cluster analysis

2003-12-03 Thread Bruno Giordano
Hi,
I'm clustering objects defined by categorical variables with a hierarchical
algorithm - average linkage.
My distance matrix (general dissimilarity coefficient) includes several
distances with exactly the same values.
As I see, a standard agglomerative procedure ignores this problems, simply
selecting, above equal distances, the one that comes first.
For this reason the analysis in output depends strongly on the orderings of
the objects within the raw data matrix.
Is there a standard procedure to deal with this?
Thanks
Bruno

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] multidimensional Fisher or Chi square test

2003-12-03 Thread Dennis Alexis Valin Dittrich
On Wed, 2003-12-03 at 14:34, [EMAIL PROTECTED] wrote:
 Is there a test for independence available based on a multidimensional
 contingency table?
 I've about 300 processes, and for each of them I get numbers for failures and
 successes. I've two or more conditions under which I test these processes.

You may look for ?mantelhaen.test 

Dennis

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] non-uniqueness in cluster analysis

2003-12-03 Thread Thomas W Blackwell
Bruno  -

Many people add a tiny random number to each of the distances,
or deliberately randomize the input order.  This means that
any clustering is not reproducible, unless you go back to the
original randoms, but it forces you not to pay attention to
minor differences.

Ah, I think you're asking about bootstrap confidence intervals
for the set of descendants from each interior vertex.  This is
certainly routine procedure when inferring evolutionary trees,
but I'm not sure any of that code has been re-implemented in R
or Splus.

-  tom blackwell  -  u michigan medical school  -  ann arbor  -

On Wed, 3 Dec 2003, Bruno Giordano wrote:

 Hi,
 I'm clustering objects defined by categorical variables with a hierarchical
 algorithm - average linkage.
 My distance matrix (general dissimilarity coefficient) includes several
 distances with exactly the same values.
 As I see, a standard agglomerative procedure ignores this problems, simply
 selecting, above equal distances, the one that comes first.
 For this reason the analysis in output depends strongly on the orderings of
 the objects within the raw data matrix.
 Is there a standard procedure to deal with this?
 Thanks
 Bruno

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] non-uniqueness in cluster analysis

2003-12-03 Thread Prof Brian Ripley
On Wed, 3 Dec 2003, Bruno Giordano wrote:

 Hi,
 I'm clustering objects defined by categorical variables with a hierarchical
 algorithm - average linkage.
 My distance matrix (general dissimilarity coefficient) includes several
 distances with exactly the same values.
 As I see, a standard agglomerative procedure ignores this problems, simply
 selecting, above equal distances, the one that comes first.
 For this reason the analysis in output depends strongly on the orderings of
 the objects within the raw data matrix.
 Is there a standard procedure to deal with this?

Don't use average linkage!

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] non-uniqueness in cluster analysis

2003-12-03 Thread Christian Hennig
Hi,

Brian Ripley already replied don't use average linkage... You
may think about k-medoid (pam) in package cluster instead.
However, often average linkage is not such a bad choice, and if you really
want to use it for your data, you may try the following:
Among the hierarchical methods, single linkage has the smallest problem
with equal distances, because possible agglomerations based on equal
distances between clusters are all carried out regardless of the order.
If at some step the smallest between cluster-distance 
is d(a,b)=d(a,c)d(b,c), it may happen that a and b are merged first, or
a and c are merged first, but before merging anything else with distance
larger than d(a,b), a, b *and* c are merged. Thus, you have order
dependence only between the steps where you merge clusters with the same
distance, but not afterwards.

If your problem occurs only at a low level of agglomeration
(and you don't have
situations where d(a,b) and d(a,c) are small and d(b,c) is very large; I 
do not know if the triangle inequality holds for your data), you may do
some first steps with Single Linkage and then continue with average
linkage (I haven't thought about if this can be done in R without extra
effort). 

But if you have already observed that the averarge linkage outcome depends
critically (from the viewpoint of interpretation) on the order of points, 
then it seems that you are in an unstable situation, if you are able to
define a unique clustering or not.

Christian

On Wed, 3 Dec 2003, Bruno Giordano wrote:

 Hi,
 I'm clustering objects defined by categorical variables with a hierarchical
 algorithm - average linkage.
 My distance matrix (general dissimilarity coefficient) includes several
 distances with exactly the same values.
 As I see, a standard agglomerative procedure ignores this problems, simply
 selecting, above equal distances, the one that comes first.
 For this reason the analysis in output depends strongly on the orderings of
 the objects within the raw data matrix.
 Is there a standard procedure to deal with this?
 Thanks
 Bruno
 
 __
 [EMAIL PROTECTED] mailing list
 https://www.stat.math.ethz.ch/mailman/listinfo/r-help
 

***
Christian Hennig
Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg
[EMAIL PROTECTED], http://www.math.uni-hamburg.de/home/hennig/
###
ich empfehle www.boag-online.de

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] non-uniqueness in cluster analysis

2003-12-03 Thread Bruno Giordano
What I did was, in presence of equal values distances, to randomize the
selection of them, and compute the distortion of the solution using
cophenetic correlation.
I computed 1 random trees for each of three methods: average, single
and complete linkage.
Among the randomly selected solutions, for the three methods, average
linkage was able to give the highest cophenetic correlation, followed by
complete and then by single linkage. Among the random trees single
linkage, for obvious reasons, gave a constant cophenetic correlation.
My data set is rather small (25 objects). I'm seriously thinking of
calculating all the possible solutions (I guess about 3), picking the
ones that give the highest cophenetic correlation, and analyzing the
consistency among those solutions, after establishing a natural number of
clusters.

Bruno

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] setMethod(min, myclass, ...)

2003-12-03 Thread John Chambers
Thomas Stabla wrote:
 
 Hello,
 
 I have defined a new class
 
  setClass(myclass, representation(min = numeric, max = numeric))
 
 and want to write accessor functions, so that for
 
  foo = new(myclass, min = 0, max = 1)
  min(foo) # prints 0
  max(foo) # prints 1
 
 At first i created a generic function for min
 
  setGeneric(min, function(..., na.rm = FALSE) standardGeneric(min))
 
 and then tried to use setMethod. And there's my problem, I don't know the
 name of the first argument which is to be passed to min.
 I can't just write:
 
  setMethod(min, myclass, function(..., na.rm = FALSE) [EMAIL PROTECTED])
 
 The same problem occurs with max.

Generally, it's not a good idea to take a well-known function name and
make it into an accessor function.

In a functional language, basic function calls such as min(x), sin(x),
etc. should have a natural interpretation.  Defining methods for these
functions is meant to do what it says:  provide a method to achieve the
general purpose of the function for particular objects.

If you want accessor functions, they should probably have names that
make their purpose obvious.  One convention, a la Java properties, would
be getMin(x), etc. (the capitalizing is potentially an issue since slot
names are case sensitive).

If you do really want a method for the min() function for myclass,
that's a different problem.

The argument ... is different from all other argument names, and it
can't be used in a signature.  To define methods for functions such as
min(), the formal arguments of the basic function would have to be
changed to, say, function(x, ..., na.rm)

Then methods can be defined for argument x.

The R versions of these functions don't currently allow methods.  If
methods are needed, a package could currently define its own version of
the (non-generic) functions to include argument x.

More than this is needed to handle multiple arguments generally.  The
problem is that defining a method for argument x does not cause that
method to be called if the object appears as a later argument.

If you had a method for min() for myclass and myX was an object from
that class
  min(myX, 1)
would work, but
  min(1, myX)
would not.  To get all examples right would require changes to the basic
code for these functions.  There is a brief discussion of one approach
in Programming with Data, pages 343 and 351.






 
 Thanks for your help.
 
 Greetings,
 Thomas Stabla
 
 __
 [EMAIL PROTECTED] mailing list
 https://www.stat.math.ethz.ch/mailman/listinfo/r-help

-- 
John M. Chambers  [EMAIL PROTECTED]
Bell Labs, Lucent Technologiesoffice: (908)582-2681
700 Mountain Avenue, Room 2C-282  fax:(908)582-3340
Murray Hill, NJ  07974web: http://www.cs.bell-labs.com/~jmc

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] intraclass correlation

2003-12-03 Thread Andrew Perrin
I have been using a little function I wrote myself; look at
http://www.unc.edu/home/aperrin/tips/src/icc.R for the code.  Not pretty,
but it works.

ap

--
Andrew J Perrin - http://www.unc.edu/~aperrin
Assistant Professor of Sociology, U of North Carolina, Chapel Hill
[EMAIL PROTECTED] * andrew_perrin (at) unc.edu


On Wed, 3 Dec 2003, Veronique Verhoeven wrote:

 Hi,


 Can R calculate an intraclass correlation coefficient for clustered data,
 when the outcome variable is dichotomous?
 By now I calculate it by hand, estimating between- and intracluster variance
 by one-way ANOVA - however I don't feel very comfortable about this, since
 the distributional assumptions are not really met
 Maybe anyone can help me?

 Best regards and many many thanks,

 Veronique Verhoeven
 University of Antwerp

 __
 [EMAIL PROTECTED] mailing list
 https://www.stat.math.ethz.ch/mailman/listinfo/r-help


__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


RE: [R] Error in randomForest.default(m, y, ...) : negative lengt h vectors are not allowed

2003-12-03 Thread Wiener, Matthew
Christian -- 

You don't provide enough information (like a call) to answer this.  I
suspect, though, that you may be subsetting in a way that passes
randomForest no data.

I'm not aware offhand of an easy way to get this error from randomForest.  I
tried creating some data superficially similar to yours to see whether
something would break if there were only a single value in the variable to
be explained, but everything worked fine (though it does give a reasonable
warning).

 test.dat - data.frame(a = rep(0, 1000), b = runif(1000), c = sample(0:1,
1000, replace = TRUE, p = c(.8, .2))
 t8 - randomForest(a ~ b + c, data = test.dat)
Warning message: 
The response has five or fewer unique values.  Are you sure you want to do
regression? in: randomForest.default(m, y, ...) 
 test.dat[sample(1:1000, 100),a] - runif(100, 1, 200)
 t8 - randomForest(a ~ b + c, data = test.dat)

Some other generated data might come up with the error, but I'd bet on the
subsetting problem.

Hope this helps,  -Matt

Matthew Wiener
RY84-202
Applied Computer Science  Mathematics Dept.
Merck Research Labs
126 E. Lincoln Ave.
Rahway, NJ 07065
732-594-5303 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Christian Schulz
Sent: Wednesday, December 03, 2003 9:42 AM
To: [EMAIL PROTECTED]
Subject: [R] Error in randomForest.default(m, y, ...) : negative length
vectors are not allowed


Hi,

what i'm doing wrong?
I'm using a data.frame with ~ 90.000 instances
and 7 attributes, 5 are binary recoded 
1 independend variable are a real one 
and the target is a real one,too.

The distributions are not very skewed in the dummy variables
,but in the real variables are ~ 60.000 
zero values instances, but zero means
no money is payed and is a important value!

Many thanks for help  suggestions,
regards,christian

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


RE: [R] HMisc describe -- error with dates

2003-12-03 Thread Tanya Murphy
Thank you Frank and Gabor for the fixes and checking and rechecking! 
Everything seems to work well with the Hmisc functions tried--upData, describe 
and summary.

To summarize:
1. Add the testDateTime and formatDateTime functions (copied from Frank's 
messages) to the Hmisc file (or run prior to loading Hmisc)


testDateTime - function(x, what=c('either','both','timeVaries')) {
  what - match.arg(what)
  cl - class(x) # was oldClass 22jun03
  if(!length(cl)) return(FALSE)

  dc - if(.R.) c('POSIXt','POSIXct','dates','times','chron') else
  c('timeDate','date','dates','times','chron')
  dtc - if(.R.) c('POSIXt','POSIXct','chron') else
  c('timeDate','chron')
  switch(what,
  either = any(cl %in% dc),
  both = any(cl %in% dtc),
  timeVaries = {
  if('chron' %in% cl || !.R.) { ## chron or S+ timeDate
  y - as.numeric(x)
  length(unique(round(y - floor(y),13)))  1
  } else if(.R.) length(unique(format(x,'%H%M%S')))  1 else
  FALSE
  })
  }

formatDateTime - function(x, at, roundDay=FALSE) {
cl - at$class
w - if(any(cl %in% c('chron','dates','times'))) {
attributes(x) - at
fmt - at$format
if(roundDay) {
if(length(fmt)==2  is.character(fmt))
format.dates(x, fmt[1]) else format.dates(x)
} else x
} else if(.R.) {
attributes(x) - at
if(roundDay) as.POSIXct(round(x, 'days')) else x
} else timeDate(julian=if(roundDay)round(x) else x)
format(w)
}

2. Replace the decribe function with the new one (available as an attachment 
in Frank's most recent message on the subject). Instead of editing the 
original Hmisc file, this could be run after the Hmisc library is loaded.

Right?


Tanya

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


AW: [R] Error in randomForest.default(m, y, ...) : negative length vectors are not allowed

2003-12-03 Thread Christian Schulz
Hmmm, thanks for your suggestions i'm in the
same opinion with any subsetting problem, but curious is
that my model i.e. with library(gbm) or simple lm works,
because my task is to find out the weights/importance values
for the attributes and i would like compare the results between
the randomForest classifier and a linear approach.

I check it with your suggestions and code snippets in detail
and feedback you the problem, if i found the solution.

regards,Christian



-Ursprüngliche Nachricht-
Von: Wiener, Matthew [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 3. Dezember 2003 17:26
An: 'Christian Schulz'; [EMAIL PROTECTED]
Betreff: RE: [R] Error in randomForest.default(m, y, ...) : negative
lengt h vectors are not allowed


Christian --

You don't provide enough information (like a call) to answer this.  I
suspect, though, that you may be subsetting in a way that passes
randomForest no data.

I'm not aware offhand of an easy way to get this error from randomForest.  I
tried creating some data superficially similar to yours to see whether
something would break if there were only a single value in the variable to
be explained, but everything worked fine (though it does give a reasonable
warning).

 test.dat - data.frame(a = rep(0, 1000), b = runif(1000), c = sample(0:1,
1000, replace = TRUE, p = c(.8, .2))
 t8 - randomForest(a ~ b + c, data = test.dat)
Warning message:
The response has five or fewer unique values.  Are you sure you want to do
regression? in: randomForest.default(m, y, ...)
 test.dat[sample(1:1000, 100),a] - runif(100, 1, 200)
 t8 - randomForest(a ~ b + c, data = test.dat)

Some other generated data might come up with the error, but I'd bet on the
subsetting problem.

Hope this helps,  -Matt

Matthew Wiener
RY84-202
Applied Computer Science  Mathematics Dept.
Merck Research Labs
126 E. Lincoln Ave.
Rahway, NJ 07065
732-594-5303

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Christian Schulz
Sent: Wednesday, December 03, 2003 9:42 AM
To: [EMAIL PROTECTED]
Subject: [R] Error in randomForest.default(m, y, ...) : negative length
vectors are not allowed


Hi,

what i'm doing wrong?
I'm using a data.frame with ~ 90.000 instances
and 7 attributes, 5 are binary recoded
1 independend variable are a real one
and the target is a real one,too.

The distributions are not very skewed in the dummy variables
,but in the real variables are ~ 60.000
zero values instances, but zero means
no money is payed and is a important value!

Many thanks for help  suggestions,
regards,christian

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


[R] checking for identical columns in a mxn matrix

2003-12-03 Thread Rajarshi Guha
Hi,
  I have a rectangular matrix and I need to check whether any columns
are identical or not. Currently I'm looping over the columns and
checking each column with all the others with identical().

However, as experience has shown me, getting rid of loops is a good idea
:) Would anybody have any suggestions as to how I could do this job more
efficiently.

(It would be nice to know which columns are identical but thats not a
necessity.)

---
Rajarshi Guha [EMAIL PROTECTED] http://jijo.cjb.net
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
---
Entropy isn't what it used to be.

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] checking for identical columns in a mxn matrix

2003-12-03 Thread Marc Schwartz
On Wed, 2003-12-03 at 12:06, Rajarshi Guha wrote:
 Hi,
   I have a rectangular matrix and I need to check whether any columns
 are identical or not. Currently I'm looping over the columns and
 checking each column with all the others with identical().
 
 However, as experience has shown me, getting rid of loops is a good idea
 :) Would anybody have any suggestions as to how I could do this job more
 efficiently.
 
 (It would be nice to know which columns are identical but thats not a
 necessity.)


If your matrix is 'x' and contains text and/or integer values (since
float comparisons can be problematic) you can use:

any(duplicated(x, MARGIN = 2))

to find out if any of the columns are duplicated and  

which(duplicated(x, MARGIN = 2))

to get the column numbers that are duplicates in the matrix.

If you want to extract the unique columns, you can use:

unique(x, MARGIN = 2)

See ?duplicated and ?unique for more information.

Example:

 x - matrix(c(1:3, 4:6, 1:3, 7:9), ncol = 4)
 x
 [,1] [,2] [,3] [,4]
[1,]1417
[2,]2528
[3,]3639

 any(duplicated(x, MARGIN = 2))
[1] TRUE

 which(duplicated(x, MARGIN = 2))
[1] 3

 unique(x, MARGIN = 2)
 [,1] [,2] [,3]
[1,]147
[2,]258
[3,]369

HTH,

Marc Schwartz

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] nameless functions in R

2003-12-03 Thread Thomas Lumley
On Wed, 3 Dec 2003, Rajarshi Guha wrote:

 Hi,
   I have an apply statement that looks like:

  check.cols - function(v1, v2) {
 + return( identical(v1,v2) );
 + }
  x
  [,1] [,2] [,3]
 [1,]133
 [2,]454
 [3,]276
  apply(x, c(2), check.cols, v2=c(7,8,9))
 [1] FALSE FALSE FALSE

 Is it possible to make the function check.cols() inline to the apply
 statement. Some thing like this:

 apply(x, c(2), funtion(v1,v2){ identical(v1,v2) }, v2=c(1,4,2))

 The above gives me a syntax error. I also tried:

 apply(x, c(2), fun - funtion(v1,v2){ return(identical(v1,v2)) },
 v2=c(1,4,2))

 and I still get a syntax error.

 Is this type of syntax allowed in R?


Yes, anonymous functions are allowed. Anonymous funtions aren't -- you
appear to have a typographical problem.

-thomas

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


[R] model of fish over exploitation

2003-12-03 Thread John Sibert
It looks like you are trying to fit Schaeffer model (a special case of the 
Pella-Tomlinsion general production model) to the data. Such models can be 
solved in a completely general way using ADModel Builder, and an example of 
the general production model application can be found at
http://otter-rsch.com/examples.htm#docs

Bon courage,
John


John Sibert, Manager
Pelagic Fisheries Research Program
University of Hawaii at Manoa
1000 Pope Road, MSB 313
Honolulu, HI 96822
United States
Phone: (808) 956-4109
Fax: (808) 956-4104

Washington DC
Phone: (202) 861 2363
Fax: (202) 861 4767

PFRP Web Site:   http://www.soest.hawaii.edu/PFRP/
email:  [EMAIL PROTECTED]
__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


RE: [R] checking for identical columns in a mxn matrix

2003-12-03 Thread Liaw, Andy
 From: Rajarshi Guha

 On Wed, 2003-12-03 at 13:18, J.R. Lockwood wrote:
 
  list will come up with something clever.  the other issues 
 is that you
  need to be careful when doing equality comparisons with 
 floating point
  numbers.  unless your matrix consists of characters or integers,
  you'll need to think about some level of numerical tolerance of your
  comparison.
 
 Yes, the matrix will always be integer.

Other than what J.R. and Marc suggested, you might could try to use

dist(t(x), method=manhattan)

and see which entries are 0 (or close enough to 0).

HTH,
Andy


 
 ---
 Rajarshi Guha [EMAIL PROTECTED] http://jijo.cjb.net
 GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE
 ---
 All theoretical chemistry is really physics; and all theoretical
 chemists 
 know it.
 -- Richard P. Feynman
 
 __
 [EMAIL PROTECTED] mailing list
 https://www.stat.math.ethz.ch/mailman/listinfo/r-help


__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] nameless functions in R

2003-12-03 Thread Bjørn-Helge Mevik
Rajarshi Guha [EMAIL PROTECTED] writes:

 apply(x, c(2), funtion(v1,v2){ identical(v1,v2) }, v2=c(1,4,2))

 The above gives me a syntax error. I also tried:

No wonder!  Try with `function' instead of `funtion'.

-- 
Bjørn-Helge Mevik

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


[R] volume of an irregular grid

2003-12-03 Thread Karim Elsawy
I have a 3d irregular grid of a surface (closed surface)
I would like to calculate the volume enclosed inside this surface
can this be done in R
any help is very much appreciated
best regards
karim
Karim

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] add a point to regression line and cook's distance

2003-12-03 Thread Spencer Graves
 What is the context?  What do the outliers represent?  If you 
think carefully about the context, you may find the answer. 

 hope this helps.  spencer graves
p.s.  I know statisticians who worked for HP before the split and who 
still work for either HP or Agilent, I'm not certain which.  If you want 
to contact me off-line, I can give you a couple of names if that might 
help. 

[EMAIL PROTECTED] wrote:

Hi, 

This is more a statistics question rather than R question. But I thought people on this list may have some pointers.

MY question is like the following:
I would like to have a robust regression line. The data I have are mostly clustered 
around a small range. So
the regression line tend to be influenced strongly by outlier points (with large 
cook's distance). From the application
's background, I know that the line should pass (0,0), which is far away from the data 
cloud. I would like to add this
point to have a more robust line. The question is: does it make sense to do this? what 
are the negative impacts if any?
thanks,
jonathan
__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
 

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] predict.gl1ce question

2003-12-03 Thread Peter Dalgaard
Richard Bonneau [EMAIL PROTECTED] writes:

 Hi,
 
 I'm using gl1ce with family=binomial like so:
  yy
succ fail
   [1,]   76   23
   [2,]   32   67
   [3,]   56   43
   ...
 [24,]   81   18
 
  xx
   c1219   c643
 X1  0.04545455 0.64274145
 X2  0.17723669 0.90392792
 ...
 X24 0.80629054 0.12239320
 
  test.gl1ce - gl1ce(yy ~ xx, family = binomial(link=logit), bound =
 0.5 )
 or
  omit - c(2,3)
  test.gl1ce - gl1ce(yy[-omit,] ~ xx[-omit,], family =
 binomial(link=logit), bound = 1 )
 
 this seems to work fine and as i change the shrinkage parameter
 everything behaves as expected.
 
 if i try to get the fitted values (y-hat) using predict i have no
 problems:
   predict.gl1ce(test.gl1ce)
   [1] 0.38129977 0.16513661 0.47666779 0.45348757 0.09916513 0.18167674
   [7] 0.11047684 0.15786664 0.14765670 0.40657031 0.19072570 0.80259477
 [13] 0.36317090 0.35930557 0.23700520 0.17579282 0.18835043 0.52306049
 [19] 0.28388953 0.41262864 0.29933710 0.43556139 0.15276727 0.73017401
 
 ***
 I have problems, however, when i try to use predict.gl1ce() with
 newdata.
 
 so, the following tries all give errors:
   predict.gl1ce(test.gl1ce, xx, family=binomial(link-logit))
   predict.gl1ce(test.gl1ce, xx)
   predict.gl1ce(test.gl1ce, xx[omit,], family=binomial(link-logit))
 Error in predict.l1ce(test.gl1ce, xx, family = binomial(link - logit)) :
  Argument `newdata' is not a data frame, and cannot be coerced
 to an appropriate model matrix
 
 the following weak try also bombs:
   predict.gl1ce(test.gl1ce, data.frame(xx), family=binomial(link-logit))
 Error in eval(expr, envir, enclos) : attempt to apply non-function
 
 
 I've tried quite a few variations. It seems I'm missing something, but
 if glm or gl1ce take a certain
 data format then the corresponding predict me,thods should (what am i
 missing).

1. try link=logit, as opposed to what you typed
2. xx is probably not a data frame (or yy~xx would unlikely work), so
   as.data.frame might do the trick.
3. what is gl1ce? I you're having trouble with an add-on package, you
   might have the courtesy to tell us which one. Not everyone uses
   lasso2 on a daily basis, you know.

-- 
   O__   Peter Dalgaard Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics 2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark  Ph: (+45) 35327918
~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


RE: [R] add a point to regression line and cook's distance

2003-12-03 Thread Wiener, Matthew
If you know that the line should pass through (0,0), would it make sense to
do a regression without an intercept?  You can do that by putting -1 in
the formula, like:  lm(y ~ x - 1).

Hope this helps,

Matt

Matthew Wiener
RY84-202
Applied Computer Science  Mathematics Dept.
Merck Research Labs
126 E. Lincoln Ave.
Rahway, NJ 07065
732-594-5303 


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Spencer Graves
Sent: Wednesday, December 03, 2003 5:51 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: [R] add a point to regression line and cook's distance


  What is the context?  What do the outliers represent?  If you 
think carefully about the context, you may find the answer. 

  hope this helps.  spencer graves
p.s.  I know statisticians who worked for HP before the split and who 
still work for either HP or Agilent, I'm not certain which.  If you want 
to contact me off-line, I can give you a couple of names if that might 
help. 

[EMAIL PROTECTED] wrote:

Hi, 

This is more a statistics question rather than R question. But I thought
people on this list may have some pointers.

MY question is like the following:
I would like to have a robust regression line. The data I have are mostly
clustered around a small range. So
the regression line tend to be influenced strongly by outlier points (with
large cook's distance). From the application
's background, I know that the line should pass (0,0), which is far away
from the data cloud. I would like to add this
point to have a more robust line. The question is: does it make sense to do
this? what are the negative impacts if any?

thanks,
jonathan

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
  


__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] add a point to regression line and cook's distance

2003-12-03 Thread Murray Jorgensen
Not a good idea, unless the regression function is *known* to be linear. 
More likely it is only approximately linear over small ranges.

Murray Jorgensen

Wiener, Matthew wrote:

If you know that the line should pass through (0,0), would it make sense to
do a regression without an intercept?  You can do that by putting -1 in
the formula, like:  lm(y ~ x - 1).
Hope this helps,

Matt

Matthew Wiener
RY84-202
Applied Computer Science  Mathematics Dept.
Merck Research Labs
126 E. Lincoln Ave.
Rahway, NJ 07065
732-594-5303 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Spencer Graves
Sent: Wednesday, December 03, 2003 5:51 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: [R] add a point to regression line and cook's distance
  What is the context?  What do the outliers represent?  If you 
think carefully about the context, you may find the answer. 

  hope this helps.  spencer graves
p.s.  I know statisticians who worked for HP before the split and who 
still work for either HP or Agilent, I'm not certain which.  If you want 
to contact me off-line, I can give you a couple of names if that might 
help. 

[EMAIL PROTECTED] wrote:


Hi, 

This is more a statistics question rather than R question. But I thought
people on this list may have some pointers.

MY question is like the following:
I would like to have a robust regression line. The data I have are mostly
clustered around a small range. So

the regression line tend to be influenced strongly by outlier points (with
large cook's distance). From the application

's background, I know that the line should pass (0,0), which is far away
from the data cloud. I would like to add this

point to have a more robust line. The question is: does it make sense to do
this? what are the negative impacts if any?

thanks,
jonathan
__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help



__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help

--
Dr Murray Jorgensen  http://www.stats.waikato.ac.nz/Staff/maj.html
Department of Statistics, University of Waikato, Hamilton, New Zealand
Email: [EMAIL PROTECTED]Fax 7 838 4155
Phone  +64 7 838 4773 wk+64 7 849 6486 homeMobile 021 1395 862
__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


RE: [R] add a point to regression line and cook's distance

2003-12-03 Thread jonathan_li

It is likely that the true relationship is nonlinear. There isn't a priori knowledge 
about linearity. In the small range where we do have enough data, the relationship
looks linear. Outside the range, the data are very scarse and have high level of 
noises too.
This is why adding (0,0) to the data can potentially improve the fit a great deal. But 
at the
same time, I have never heard people doing it this way. 

Jonathan

-Original Message-
From: Murray Jorgensen [mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 03, 2003 5:18 PM
To: Wiener, Matthew
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: [R] add a point to regression line and cook's distance


Not a good idea, unless the regression function is *known* to be linear. 
More likely it is only approximately linear over small ranges.

Murray Jorgensen

Wiener, Matthew wrote:

 If you know that the line should pass through (0,0), would it make sense to
 do a regression without an intercept?  You can do that by putting -1 in
 the formula, like:  lm(y ~ x - 1).
 
 Hope this helps,
 
 Matt
 
 Matthew Wiener
 RY84-202
 Applied Computer Science  Mathematics Dept.
 Merck Research Labs
 126 E. Lincoln Ave.
 Rahway, NJ 07065
 732-594-5303 
 
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Spencer Graves
 Sent: Wednesday, December 03, 2003 5:51 PM
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Subject: Re: [R] add a point to regression line and cook's distance
 
 
   What is the context?  What do the outliers represent?  If you 
 think carefully about the context, you may find the answer. 
 
   hope this helps.  spencer graves
 p.s.  I know statisticians who worked for HP before the split and who 
 still work for either HP or Agilent, I'm not certain which.  If you want 
 to contact me off-line, I can give you a couple of names if that might 
 help. 
 
 [EMAIL PROTECTED] wrote:
 
 
Hi, 

This is more a statistics question rather than R question. But I thought
 
 people on this list may have some pointers.
 
MY question is like the following:
I would like to have a robust regression line. The data I have are mostly
 
 clustered around a small range. So
 
the regression line tend to be influenced strongly by outlier points (with
 
 large cook's distance). From the application
 
's background, I know that the line should pass (0,0), which is far away
 
 from the data cloud. I would like to add this
 
point to have a more robust line. The question is: does it make sense to do
 
 this? what are the negative impacts if any?
 
thanks,
jonathan

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
 

 
 
 __
 [EMAIL PROTECTED] mailing list
 https://www.stat.math.ethz.ch/mailman/listinfo/r-help
 
 __
 [EMAIL PROTECTED] mailing list
 https://www.stat.math.ethz.ch/mailman/listinfo/r-help
 
 

-- 
Dr Murray Jorgensen  http://www.stats.waikato.ac.nz/Staff/maj.html
Department of Statistics, University of Waikato, Hamilton, New Zealand
Email: [EMAIL PROTECTED]Fax 7 838 4155
Phone  +64 7 838 4773 wk+64 7 849 6486 homeMobile 021 1395 862

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


[R] RE: R performance questions

2003-12-03 Thread Michael Benjamin
Hi--

While I agree that we cannot agree on the ideal algorithms, we should be
taking practical steps to implement microarrays in the clinic.  I think
we can all agree that our algorithms have some degree of efficacy over
and above conventional diagnostic techniques.  If patients are dying
from lack of diagnostic accuracy, I think we have to work hard to use
this technology to help them, if we can.  I think we can, even now.

What if I offer, in my clinic, a service for cancer patients to compare
their affy data to an existing set of data, to predict their prognosis
or response to chemotherapy?  I think people will line up out the door
for such a service.  Knowing what we as a group of array analyzers know,
wouldn't we all want this kind of service available if we or a loved one
got cancer?

Can our programs deal with 1,000 .cel files?  10,000 files?  

I think our programs are pretty good, but what we need is DATA.  We must
be careful what we wish for--we might get it!  So how do we measure
whether analyzing 10,000 .cel files with library(affy) is feasible?  I'm
assuming that advanced hardware would be required for such a task.  What
are the critical components of such a platform?  How much money would a
feasible system for array analysis cost?

I was just looking ahead two or three years--where is all this genomic
array research headed?  I guess I'm concerned about scalability.  

Is anyone really working on implementing affy on a cluster/Beowulf?
That sounds like a real challenge.

Regards,
Michael Benjamin, MD
-Original Message-
From: Liaw, Andy [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, December 03, 2003 9:47 PM
To: 'Michael Benjamin'
Subject: RE: [BioC] R performance questions

Another point about benchmarking:  As has been discussed on R-help
before,
benchmarks can be misleading, as the one you mentioned.  It measures
linear
algebra tasks, etc., but that typically account for very small portion
of
average tasks.  Doug Bates also pointed out that the eigen() example
used
in that benchmark is computing mostly meaningless results.

In our experience, learning to use R more efficiently gives us the most
mileage, but large and fast hardware wouldn't hurt...

Cheers,
Andy

 -Original Message-
 From: Michael Benjamin [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, December 03, 2003 7:32 PM
 To: 'Liaw, Andy'
 Subject: RE: [BioC] R performance questions
 
 
 Thanks.
 Mike
 
 -Original Message-
 From: Liaw, Andy [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, December 03, 2003 8:17 AM
 To: 'Michael Benjamin'
 Subject: RE: [BioC] R performance questions
 
 Hi Michael,
 
 Just one comment about SVM.  If you use the svm() function in 
 the e1071
 package to train linear SVM, it will be rather slow.  That's a known
 limitation of libsvm, of which the svm() function uses.  If you are
 willing
 to go outside of R, the bsvm package by C.J. Lin (same person who
 wrote
 libsvm) will train linear svm in much more efficient manner.
 
 HTH,
 Andy
 
  -Original Message-
  From: [EMAIL PROTECTED] 
  [mailto:[EMAIL PROTECTED] On Behalf Of 
  Michael Benjamin
  Sent: Tuesday, December 02, 2003 10:30 PM
  To: [EMAIL PROTECTED]
  Subject: [BioC] R performance questions
  
  
  Hi, all--
  
  I wanted to start a thread on R speed/benchmarking.  There 
 is a nice R
  benchmarking overview at 
 http://www.sciviews.org/other/benchmark.htm,
  along with a 
 free script so you can see how your machine stacks up.
  
  Looks like R is substantially faster than S-plus.
  
  My problem is this: with 512Mb and an overclocked AMD 
 Athlon XP 1800+,
  running at 588 SPEC-FP 2000, it still takes FOREVER to 
  analyze multiple
  .cel files using affy (expresso).  Running svm takes a mighty 
  long time
  with more than 500 genes, 150 samples.
  
  Questions:
  1) Would adding RAM or processing speed improve performance 
 the most?
  2) Is it possible to run R on a cluster without rewriting my 
  high-level
  code?  In other words,
  3) What are we going to do when we start collecting 
 terabytes of array
  data to analyze?  There will come a breaking point at 
 which desktop
  systems can't perform these analyses fast enough for large 
  quantities of
  data.  What then?
  
  Michael Benjamin, MD
  Winship Cancer Institute
  Emory University,
  Atlanta, GA
  
  ___
  Bioconductor mailing list
  [EMAIL PROTECTED]
  https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
  
 
 
 


__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] RE: R performance questions

2003-12-03 Thread A.J. Rossini
Michael Benjamin [EMAIL PROTECTED] writes:

 I was just looking ahead two or three years--where is all this genomic
 array research headed?  I guess I'm concerned about scalability.  

Me too -- but at least in the near future, data will be growing more
than the capacity to process it.

 Is anyone really working on implementing affy on a cluster/Beowulf?
 That sounds like a real challenge.

Yes and no.  Depends on which components you want to deal with, and
how you want to work with the data.   

Everything (with respect to speed/capacity/etc) is especially
contextual -- applications and approximations will be quite
important. 

best,
-tony

-- 
[EMAIL PROTECTED]http://www.analytics.washington.edu/ 
Biomedical and Health Informatics   University of Washington
Biostatistics, SCHARP/HVTN  Fred Hutchinson Cancer Research Center
UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable
FHCRC  (M/W): 206-667-7025 FAX=206-667-4812 | use Email

CONFIDENTIALITY NOTICE: This e-mail message and any attachme...{{dropped}}

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] add a point to regression line and cook's distance

2003-12-03 Thread Jason Turner
[EMAIL PROTECTED] wrote:

Hi, 

MY question is like the following:
I would like to have a robust regression line. The data I have are 
 mostly clustered around a small range. So
the regression line tend to be influenced strongly by outlier points 
 (with large cook's distance). From the application's
 background, I know that the line should pass (0,0), which is far
 away from the data cloud. I would like to add this
point to have a more robust line. The question is: 
 does it make sense to do this? what are the negative impacts if any?

Have you tried a more robust fit (ltsreg() in the package lqs springs to 
mind)?  Using this, without forcing the intercept to zero, might give 
you some idea if your idea makes sense.  Venables and Ripley (Modern 
Applied Statistics with S, Springer-Verlag, 2002) give a good 
introduction to robust linear models, and how to estimate their error 
distribution.  Julian Faraway also gives an overview of the same, in his 
Practical Regression and ANOVA using R.
http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf

Hope that helps

Jason
--
Indigo Industrial Controls Ltd.
http://www.indigoindustrial.co.nz
64-21-343-545
[EMAIL PROTECTED]
__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help