Re: [R] Using split() several times in a row?

2007-03-31 Thread Martin Maechler
 SteT == Stephen Tucker [EMAIL PROTECTED]
 on Fri, 30 Mar 2007 18:41:39 -0700 (PDT) writes:

  [..]

SteT For dates, I usually store them as POSIXct classes
SteT in data frames, but according to Gabor Grothendieck
SteT and Thomas Petzoldt's R Help Desk article
SteT http://cran.r-project.org/doc/Rnews/Rnews_2004-1.pdf,
SteT I should probably be using chron date and times...

I don't think you should (and I doubt Gabor and Thomas would
recommend this in every case):

POSIXct (and 'POSIXlt', 'POSIXt'  'Date') are part of standard R,
and whereas they may seem not as convenient in all cases as chron
etc, I'd rather recommed to stick to them in such a case.

SteT Nonetheless, POSIXct casses are what I know so I can
SteT show you that to get the month out of your column
SteT (replace 8.29.97 with your variable), you can do the
SteT following:

SteT month = format(strptime(8.29.97,format=%m.%d.%y),format=%m)

SteT Or,
SteT month = as.data.frame(strsplit(8.29.97,\\.))[1,]

  [..etc..veryuseful..advice]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Using split() several times in a row?

2007-03-31 Thread Gabor Grothendieck
On 3/31/07, Martin Maechler [EMAIL PROTECTED] wrote:
  SteT == Stephen Tucker [EMAIL PROTECTED]
  on Fri, 30 Mar 2007 18:41:39 -0700 (PDT) writes:

  [..]

SteT For dates, I usually store them as POSIXct classes
SteT in data frames, but according to Gabor Grothendieck
SteT and Thomas Petzoldt's R Help Desk article
SteT http://cran.r-project.org/doc/Rnews/Rnews_2004-1.pdf,
SteT I should probably be using chron date and times...

 I don't think you should (and I doubt Gabor and Thomas would
 recommend this in every case):

 POSIXct (and 'POSIXlt', 'POSIXt'  'Date') are part of standard R,
 and whereas they may seem not as convenient in all cases as chron
 etc, I'd rather recommed to stick to them in such a case.

There is one change that has occurred since the article that in my
mind would let you safely use POSIX but its pretty drastic.  At the time
of the article you could not set the time zone to GMT in the R process
on Windows but now you can do this:

Sys.putenv(TZ = GMT)

and you can also change it back like this:

Sys.putenv(TZ = )

Since the problem is that you never can be sure which time zone the
time is interpreted in within various function (although you can be pretty
sure its either the local time zone or GMT) by setting the process to
GMT you make the two alternatives the same so it no longer matters.

Short of the above, the recommendations of the article should be followed.
Its not a matter of convenience.  Its a matter of being error prone
and introducing
subtle time-zone related errors into your code which are very hard to track
down or worse, even realize that you have.

Those who claim that its not a problem simply have not used dates and times
enough or they would not say that.  I have seen posters make such comments
on this list only later to run into subtle time zone problems that they never
would have had had they followed the advice in the article.

I've used R and dates a lot and therefore have made a lot of programming errors
and these recommendations come from bitter experience looking back to see
how I could have avoided them.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Using split() several times in a row?

2007-03-30 Thread Stephen Tucker
Hi Sergey,

I believe the code below should get you close to want you want.

For dates, I usually store them as POSIXct classes in data frames, but
according to Gabor Grothendieck and Thomas Petzoldt's R Help Desk article
http://cran.r-project.org/doc/Rnews/Rnews_2004-1.pdf, I should probably be
using chron date and times...

Nonetheless, POSIXct casses are what I know so I can show you that to get the
month out of your column (replace 8.29.97 with your variable), you can do
the following:

month = format(strptime(8.29.97,format=%m.%d.%y),format=%m)

Or,
month = as.data.frame(strsplit(8.29.97,\\.))[1,]

In any case, here is a code, in which I follow a series of function
application and definitions (which effectively includes successive
application of split() and lapply().

Best regards,

ST

# define data (I just made this up)
df -
data.frame(month=as.character(rep(1:3,each=30)),fac=factor(rep(1:2,each=15)),
data1=round(runif(90),2),
data2=round(runif(90),2))

# define functions to split the data and another
# to get statistics
doSplits - function(df) {
  unlist(lapply(split(df,df$month),function(x)
split(x,x$fac)),recursive=FALSE)
}
getStats - function(x,f) {
  return(as.data.frame(lapply(x[unlist(lapply(x,mode))==numeric 
unlist(lapply(x,class))!=factor],f)))
}
# create a matrix of data, means, and standard deviations
listMatrix - cbind(Data=doSplits(df),
   Means=lapply(doSplits(df),getStats,mean),
   SDs=lapply(doSplits(df),getStats,sd))

# function to subtract means and divide by standard deviations
transformData - function(x) {
  newdata - x$Data
  matchedNames - match(names(x$Means),names(x$Data))
  newdata[matchedNames] -
sweep(sweep(data.matrix(x$Data[matchedNames]),2,unlist(x$Means),-),
  2,unlist(x$SDs),/)
  return(newdata)
}
# apply to data
newDF - lapply(as.data.frame(t(listMatrix)),transformData)

# Defind Fold function
Fold - function(f, x, L) for(e in L) x - f(x, e)
# Apply this to the data
finalData - Fold(rbind,vector(),newDF)






--- Sergey Goriatchev [EMAIL PROTECTED] wrote:

 Hi, fellow R users.
 
 I have a question about sapply and split combination.
 
 I have a big dataframe (4 observations, 21 variables). First
 variable (factor) is date and it is in format 8.29.97, that is, I
 have monthly data. Second variable (also factor) has levels 1 to 6
 (fractiles 1 to 5 and missing value with code 6). The other 19
 variables are numeric.
 For each month I have several hunder observations of 19 numeric and 1
 factor.
 
 I am normalizing the numeric variables by dividing val1 by val2, where:
 
 val1: (for each month, for each numeric variable) difference between
 mean of ith numeric variable in fractile 1, and mean of ith numeric
 variable in fractile 5.
 
 val2: (for each month, for each numeric variable) standard deviation
 for ith numeric variable.
 
 Basically, as far as I understand, I need to use split() function several
 times.
 To calculate val1 I need to use split() twice - first to split by
 month and then split by fractile. Is this even possible to do (since
 after first application of split() I get a list)??
 
 Is there a smart way to perform this normalization computation?
 
 My knowledge of R is not so advanced, but I need to know an efficient
 way to perform calculations of this kind.
 
 Would really appreciate some help from experienced R users!
 
 Regards,
 S
 
 -- 
 Laziness is nothing more than the habit of resting before you get tired.
 - Jules Renard (writer)
 
 Experience is one thing you can't get for nothing.
 - Oscar Wilde (writer)
 
 When you are finished changing, you're finished.
 - Benjamin Franklin (Diplomat)
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.