[R] Help with plotting kohonen maps
Dear all, I recently started using the kohonen package for my thesis project. I have a very simple question which I cannot figure out by myself: When I execute the following example code, from the paper of Wehrens and Buydens (http://www.jstatsoft.org/v21/i05/paper): R library(kohonen) Loading required package: class R data(wines) R wines.sc - scale(wines) R set.seed(7) R wine.som - som(data = wines.sc, grid = somgrid(5, 4, hexagonal)) R plot(wine.som, main = Wine data) I get to have a plot of the codebook vectors of the 5-by-4 mapping of the wine data, and it also includes which variable names correspond to each color. (same picture as in the paper) However, when I run the som() function with my own data and I try to get the plot afterwards: library(kohonen) self_Organising_Map - som(data = tableToCluster, grid = somgrid(5, 2, rectangular), rlen=1000) plot(self_Organising_Map, main = Kohonen Map of Clustered Profiles) the resulting plot does not contain the color labels i.e. the variable names of my data table, even though they exist and are included as column names of tableToCluster. I also tried the following line: plot(self_Organising_Map, type=codes, codeRendering = segments, ncolors=length(colnames(self_Organising_Map$codes)), palette.name=rainbow, main = Kohonen Map of Clustered Profiles \n Codes, zlim =colnames(self_Organising_Map$codes)) but it had the same result. If you could please help with what argument I should use to show the color labels in the codes plot of the kohonen map, please drop a line! Kind regards, Stella -- Stella Pachidi Master in Business Informatics student Utrecht University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Help on aggregate method
Dear R experts, I would really appreciate if you had an idea on how to use more efficiently the aggregate method: More specifically, I would like to calculate the mean of certain values on a data frame, grouped by various attributes, and then create a new column in the data frame that will have the corresponding mean for every row. I attach part of my code: matchMean - function(ind,dataTable,aggrTable) { index - which((aggrTable[,1]==dataTable[[Attr1]][ind]) (aggrTable[,2]==dataTable[[Attr2]][ind])) as.numeric(aggrTable[index,3]) } avgDur - aggregate(ap.dat[[Dur]], by = list(ap.dat[[Attr1]], ap.dat[[Attr2]]), FUN=mean) meanDur - sapply((1:length(ap.dat[,1])), FUN=matchMean, ap.dat, avgDur) ap.dat - cbind (ap.dat, meanDur) As I deal with very large dataset, it takes long time to run my matching function, so if you had an idea on how to automate more this matching process I would be really grateful. Thank you very much in advance! Kind regards, Stella -- Stella Pachidi Master in Business Informatics student Utrecht University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help on aggregate method
Dear Erik and R experts, Thank you for the fast response! I include an example with the ChickWeight dataset: ap.dat - ChickWeight matchMeanEx - function(ind,dataTable,aggrTable) { index - which((aggrTable[,1]==dataTable[[Diet]][ind]) (aggrTable[,2]==dataTable[[Chick]][ind])) as.numeric(aggrTable[index,3]) } avgW - aggregate(ap.dat[[weight]], by = list(ap.dat[[Diet]], ap.dat[[Chick]]), FUN=mean) meanW - sapply((1:length(ap.dat[,1])), FUN=matchMeanEx, ap.dat, avgW) ap.dat - cbind (ap.dat, meanW) Best regards, Stella On Tue, Jun 1, 2010 at 4:58 PM, Erik Iverson er...@ccbr.umn.edu wrote: It's easiest for us to help if you give us a reproducible example. We don't have your datasets (ap.dat), so we can't run your code below. It's easy to create sample data with the random number generators in R, or use ?dput to give us a sample of your actual data.frame. I would guess your problem is solved by ?ave though. Stella Pachidi wrote: Dear R experts, I would really appreciate if you had an idea on how to use more efficiently the aggregate method: More specifically, I would like to calculate the mean of certain values on a data frame, grouped by various attributes, and then create a new column in the data frame that will have the corresponding mean for every row. I attach part of my code: matchMean - function(ind,dataTable,aggrTable) { index - which((aggrTable[,1]==dataTable[[Attr1]][ind]) (aggrTable[,2]==dataTable[[Attr2]][ind])) as.numeric(aggrTable[index,3]) } avgDur - aggregate(ap.dat[[Dur]], by = list(ap.dat[[Attr1]], ap.dat[[Attr2]]), FUN=mean) meanDur - sapply((1:length(ap.dat[,1])), FUN=matchMean, ap.dat, avgDur) ap.dat - cbind (ap.dat, meanDur) As I deal with very large dataset, it takes long time to run my matching function, so if you had an idea on how to automate more this matching process I would be really grateful. Thank you very much in advance! Kind regards, Stella -- Stella Pachidi Master in Business Informatics student Utrecht University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Stella Pachidi Master in Business Informatics student Utrecht University email: s.pach...@students.uu.nl tel: +31644478898 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help on aggregate method
Dear Erik, Thank you very much. Indeed ave did the same job amazingly fast! I did not know the function before. Many thanks to all R experts who answer to this mailing list, it's amazing how much help you offer to the newbies :) Kind regards, Stella On Tue, Jun 1, 2010 at 6:11 PM, Erik Iverson er...@ccbr.umn.edu wrote: Stella Pachidi wrote: Dear Erik and R experts, Thank you for the fast response! I include an example with the ChickWeight dataset: ap.dat - ChickWeight matchMeanEx - function(ind,dataTable,aggrTable) { index - which((aggrTable[,1]==dataTable[[Diet]][ind]) (aggrTable[,2]==dataTable[[Chick]][ind])) as.numeric(aggrTable[index,3]) } avgW - aggregate(ap.dat[[weight]], by = list(ap.dat[[Diet]], ap.dat[[Chick]]), FUN=mean) meanW - sapply((1:length(ap.dat[,1])), FUN=matchMeanEx, ap.dat, avgW) ap.dat - cbind (ap.dat, meanW) How about simply using ave. ap.dat$meanW - ave(ap.dat$weight, list(ap.dat$Diet, ap.dat$Chick)) -- Stella Pachidi Master in Business Informatics student Utrecht University email: s.pach...@students.uu.nl tel: +31644478898 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Question about difftime()
Dear R experts, I have a question about the result of difftime() function: Does it take into account the different number of days in each month. In my example, I have the following: firstDay [1] 2010-02-20 lastDay [1] 2010-05-20 16:00:00 difftime(lastDay,firstDay,units='days') Time difference of 89.625 days When I count the days I get 88 days from 20/02/2010 to 20/05/2010 consequently the difference in days should be 87. On the contrary, difftime gives a higher number, so I doubt whether it takes into account the fact that february has 28 days (or 29). Could you please help? Thank you very much in advance. Kind regards, Stella -- Stella Pachidi Master in Business Informatics student Utrecht University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Huge data sets and RAM problems
Dear all, Thank you very much for your replies and help. I will try to work with your suggestions and come back to you if I need something more. Kind regards, Stella Pachidi On Thu, Apr 22, 2010 at 5:30 AM, kMan kchambe...@gmail.com wrote: You set records to NULL perhaps (delete, shift up). Perhaps your system is susceptible to butterflies on the other side of the world. Your code may have 'worked' on a small section of data, but the data used did not include all of the cases needed to fully test your code. So... test your code! scan(), used with 'nlines', 'skip', 'sep', and 'what' will cut your read time by at least half while taking less RAM memory to do it, do most of your post processing, and give you something to better test your code. Or, don't use 'nlines' and lose your time/memory benefits over read.table(). 'skip' will get you right to the point before where things failed. That would be an interesting small segment of data to test with. wordpad can read your file (and then some). Eventually. Sincerely, KeithC. -Original Message- From: Stella Pachidi [mailto:stella.pach...@gmail.com] Sent: Monday, April 19, 2010 2:07 PM To: r-h...@stat.math.ethz.ch Subject: [R] Huge data sets and RAM problems Dear all, This is the first time I am sending mail to the mailing list, so I hope I do not make a mistake... The last months I have been working on my MSc thesis project on performing data mining techniques on user logs of a software-as-a-service application. The main problem I am experiencing is how to process the huge amount of data. More specifically: I am using R 2.10.1 in a laptop with Windows 7 - 32bit system, 2GB RAM and CPU Intel Core Duo 2GHz. The user logs data come from a query Crystal report (.rpt file) which I transform with some Java code into a tab separated file. Although with a small subset of my data everything manages to run, when I increase the data set I get several problems: The first problem is with the use of read.delim(). When I try to read a big amount of data (over 2.400.000 rows and 18 attributes at each row) it doesn't seem to transform all table into a data frame. In particular, the data frame returned has 1.220.987 rows. Furthermore, as one of the data attributes is DataTime, when I try to split this column into two columns (one with Data and one with the Time), the returned result is quite strange, as the two new columns appear to have more rows than the data frame: applicLog.dat - read.delim(file.txt) #Process the syscreated column (Date time -- Date + time) copyDate - applicLog.dat[[ï..syscreated]] copyDate - as.character(copyDate) splitDate - strsplit(copyDate, ) splitDate - unlist(splitDate) splitDateIndex - c(1:length(splitDate)) sysCreatedDate - splitDate[splitDateIndex %% 2 == 1] sysCreatedTime - splitDate[splitDateIndex %% 2 == 0] sysCreatedDate - strptime(sysCreatedDate, format=%Y-%m-%d) op - options(digits.secs = 3) sysCreatedTime - strptime(sysCreatedTime, format =%H:%M:%OS) applicLog.dat[[ï..syscreated]] - NULL applicLog.dat - cbind (sysCreatedDate,sysCreatedTime,applicLog.dat) Then I get the error: Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 1221063, 1221062, 1220987 Finally, another problem I have is when I perform association mining on the data set using the package arules: I turn the data frame into transactions table and then run the apriori algorithm. When I put too low support in order to manage to find the rules I need, the vector of rules becomes too big and I get problems with the memory such as: Error: cannot allocate vector of size 923.1 Mb In addition: Warning messages: 1: In items(x) : Reached total allocation of 153Mb: see help(memory.size) Could you please help me with how I could allocate more RAM? Or, do you think there is a way to process the data by loading them into a document instead of loading all into RAM? Do you know how I could manage to read all my data set? I would really appreciate your help. Kind regards, Stella Pachidi PS: Do you know any text editor that can read huge .txt files? -- Stella Pachidi Master in Business Informatics student Utrecht University -- Stella Pachidi Master in Business Informatics student Utrecht University email: s.pach...@students.uu.nl tel: +31644478898 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Huge data sets and RAM problems
Dear all, This is the first time I am sending mail to the mailing list, so I hope I do not make a mistake... The last months I have been working on my MSc thesis project on performing data mining techniques on user logs of a software-as-a-service application. The main problem I am experiencing is how to process the huge amount of data. More specifically: I am using R 2.10.1 in a laptop with Windows 7 - 32bit system, 2GB RAM and CPU Intel Core Duo 2GHz. The user logs data come from a query Crystal report (.rpt file) which I transform with some Java code into a tab separated file. Although with a small subset of my data everything manages to run, when I increase the data set I get several problems: The first problem is with the use of read.delim(). When I try to read a big amount of data (over 2.400.000 rows and 18 attributes at each row) it doesn't seem to transform all table into a data frame. In particular, the data frame returned has 1.220.987 rows. Furthermore, as one of the data attributes is DataTime, when I try to split this column into two columns (one with Data and one with the Time), the returned result is quite strange, as the two new columns appear to have more rows than the data frame: applicLog.dat - read.delim(file.txt) #Process the syscreated column (Date time -- Date + time) copyDate - applicLog.dat[[ï..syscreated]] copyDate - as.character(copyDate) splitDate - strsplit(copyDate, ) splitDate - unlist(splitDate) splitDateIndex - c(1:length(splitDate)) sysCreatedDate - splitDate[splitDateIndex %% 2 == 1] sysCreatedTime - splitDate[splitDateIndex %% 2 == 0] sysCreatedDate - strptime(sysCreatedDate, format=%Y-%m-%d) op - options(digits.secs = 3) sysCreatedTime - strptime(sysCreatedTime, format =%H:%M:%OS) applicLog.dat[[ï..syscreated]] - NULL applicLog.dat - cbind (sysCreatedDate,sysCreatedTime,applicLog.dat) Then I get the error: Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 1221063, 1221062, 1220987 Finally, another problem I have is when I perform association mining on the data set using the package arules: I turn the data frame into transactions table and then run the apriori algorithm. When I put too low support in order to manage to find the rules I need, the vector of rules becomes too big and I get problems with the memory such as: Error: cannot allocate vector of size 923.1 Mb In addition: Warning messages: 1: In items(x) : Reached total allocation of 153Mb: see help(memory.size) Could you please help me with how I could allocate more RAM? Or, do you think there is a way to process the data by loading them into a document instead of loading all into RAM? Do you know how I could manage to read all my data set? I would really appreciate your help. Kind regards, Stella Pachidi PS: Do you know any text editor that can read huge .txt files? -- Stella Pachidi Master in Business Informatics student Utrecht University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.