[R] Help with plotting kohonen maps

2010-11-22 Thread Stella Pachidi
Dear all,

I recently started using the kohonen package for my thesis project. I
have a very simple question which I cannot figure out by myself:

When I execute the following example code, from the paper of Wehrens
and Buydens (http://www.jstatsoft.org/v21/i05/paper):

R library(kohonen)
Loading required package: class
R data(wines)
R wines.sc - scale(wines)
R set.seed(7)
R wine.som - som(data = wines.sc, grid = somgrid(5, 4, hexagonal))
R plot(wine.som, main = Wine data)

I get to have a plot of the codebook vectors of the 5-by-4 mapping of
the wine data, and it also includes which  variable names correspond
to each color. (same picture as in the paper)

However, when I run the som() function with my own data and I try to
get the plot afterwards:

library(kohonen)
self_Organising_Map - som(data = tableToCluster, grid = somgrid(5, 2,
rectangular), rlen=1000)
plot(self_Organising_Map, main = Kohonen Map of Clustered Profiles)

 the resulting plot does not contain the color labels i.e. the
variable names of my data table, even though they exist and are
included as column names of tableToCluster.

I also tried the following line:

plot(self_Organising_Map, type=codes, codeRendering = segments,
ncolors=length(colnames(self_Organising_Map$codes)),
palette.name=rainbow, main = Kohonen Map of Clustered Profiles \n
Codes, zlim =colnames(self_Organising_Map$codes))

but it had the same result.

If you could please help with what argument I should use to show the
color labels in the codes plot of the kohonen map, please drop a line!

Kind regards,
Stella

-- 
Stella Pachidi
Master in Business Informatics student
Utrecht University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Help on aggregate method

2010-06-01 Thread Stella Pachidi
Dear R experts,

I would really appreciate if you had an idea on how to use more
efficiently the aggregate method:

More specifically, I would like to calculate the mean of certain
values on a data frame,  grouped by various attributes, and then
create a new column in the data frame that will have the corresponding
mean for every row. I attach part of my code:

matchMean - function(ind,dataTable,aggrTable)
{
index - which((aggrTable[,1]==dataTable[[Attr1]][ind]) 
(aggrTable[,2]==dataTable[[Attr2]][ind]))
as.numeric(aggrTable[index,3])
}

avgDur - aggregate(ap.dat[[Dur]], by = list(ap.dat[[Attr1]],
ap.dat[[Attr2]]), FUN=mean)
meanDur - sapply((1:length(ap.dat[,1])), FUN=matchMean, ap.dat, avgDur)
ap.dat - cbind (ap.dat, meanDur)

As I deal with very large dataset, it takes long time to run my
matching function, so if you had an idea on how to automate more this
matching process I would be really grateful.

Thank you very much in advance!

Kind regards,
Stella



--
Stella Pachidi
Master in Business Informatics student
Utrecht University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help on aggregate method

2010-06-01 Thread Stella Pachidi
Dear Erik and R experts,

Thank you for the fast response!

I include an example with the ChickWeight dataset:

ap.dat - ChickWeight

matchMeanEx - function(ind,dataTable,aggrTable)
{
   index - which((aggrTable[,1]==dataTable[[Diet]][ind]) 
(aggrTable[,2]==dataTable[[Chick]][ind]))
   as.numeric(aggrTable[index,3])
}

avgW - aggregate(ap.dat[[weight]], by = list(ap.dat[[Diet]],
ap.dat[[Chick]]), FUN=mean)
meanW - sapply((1:length(ap.dat[,1])), FUN=matchMeanEx, ap.dat, avgW)
ap.dat - cbind (ap.dat, meanW)


Best regards,
Stella


On Tue, Jun 1, 2010 at 4:58 PM, Erik Iverson er...@ccbr.umn.edu wrote:

 It's easiest for us to help if you give us a reproducible example.  We
 don't have your datasets (ap.dat), so we can't run your code below. It's
 easy to create sample data with the random number generators in R, or use
 ?dput to give us a sample of your actual data.frame.

 I would guess your problem is solved by ?ave though.

 Stella Pachidi wrote:

 Dear R experts,

 I would really appreciate if you had an idea on how to use more
 efficiently the aggregate method:

 More specifically, I would like to calculate the mean of certain
 values on a data frame,  grouped by various attributes, and then
 create a new column in the data frame that will have the corresponding
 mean for every row. I attach part of my code:

 matchMean - function(ind,dataTable,aggrTable)
 {
index - which((aggrTable[,1]==dataTable[[Attr1]][ind]) 
 (aggrTable[,2]==dataTable[[Attr2]][ind]))
as.numeric(aggrTable[index,3])
 }

 avgDur - aggregate(ap.dat[[Dur]], by = list(ap.dat[[Attr1]],
 ap.dat[[Attr2]]), FUN=mean)
 meanDur - sapply((1:length(ap.dat[,1])), FUN=matchMean, ap.dat, avgDur)
 ap.dat - cbind (ap.dat, meanDur)

 As I deal with very large dataset, it takes long time to run my
 matching function, so if you had an idea on how to automate more this
 matching process I would be really grateful.

 Thank you very much in advance!

 Kind regards,
 Stella



 --
 Stella Pachidi
 Master in Business Informatics student
 Utrecht University

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Stella Pachidi
Master in Business Informatics student
Utrecht University
email: s.pach...@students.uu.nl
tel: +31644478898

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help on aggregate method

2010-06-01 Thread Stella Pachidi
Dear Erik,

Thank you very much. Indeed ave did the same job amazingly fast! I did not
know the function before.

Many thanks to all R experts who answer to this mailing list, it's amazing
how much help you offer to the newbies :)

Kind regards,
Stella

On Tue, Jun 1, 2010 at 6:11 PM, Erik Iverson er...@ccbr.umn.edu wrote:



 Stella Pachidi wrote:

 Dear Erik and R experts,

 Thank you for the fast response!

 I include an example with the ChickWeight dataset:

 ap.dat - ChickWeight

 matchMeanEx - function(ind,dataTable,aggrTable)
 {
   index - which((aggrTable[,1]==dataTable[[Diet]][ind]) 
 (aggrTable[,2]==dataTable[[Chick]][ind]))
   as.numeric(aggrTable[index,3])
 }

 avgW - aggregate(ap.dat[[weight]], by = list(ap.dat[[Diet]],
 ap.dat[[Chick]]), FUN=mean)
 meanW - sapply((1:length(ap.dat[,1])), FUN=matchMeanEx, ap.dat, avgW)
 ap.dat - cbind (ap.dat, meanW)



 How about simply using ave.

 ap.dat$meanW - ave(ap.dat$weight, list(ap.dat$Diet, ap.dat$Chick))




-- 
Stella Pachidi
Master in Business Informatics student
Utrecht University
email: s.pach...@students.uu.nl
tel: +31644478898

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Question about difftime()

2010-05-20 Thread Stella Pachidi
Dear R experts,

I have a question about the result of difftime() function: Does it
take into account the different number of days in each month. In my
example, I have the following:

 firstDay
[1] 2010-02-20
 lastDay
[1] 2010-05-20 16:00:00
 difftime(lastDay,firstDay,units='days')
Time difference of 89.625 days


When I count the days I get 88 days from 20/02/2010 to 20/05/2010
consequently the difference in days should be 87.

On the contrary, difftime gives a higher number, so I doubt whether it
takes into account the fact that february has 28 days (or 29). Could
you please help?

Thank you very much in advance.

Kind regards,
Stella

--
Stella Pachidi
Master in Business Informatics student
Utrecht University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Huge data sets and RAM problems

2010-04-22 Thread Stella Pachidi
Dear all,

Thank you very much for your replies and help. I will try to work with
your suggestions and come back to you if I need something more.

Kind regards,
Stella Pachidi

On Thu, Apr 22, 2010 at 5:30 AM, kMan kchambe...@gmail.com wrote:
 You set records to NULL perhaps (delete, shift up). Perhaps your system is
 susceptible to butterflies on the other side of the world.

 Your code may have 'worked' on a small section of data, but the data used
 did not include all of the cases needed to fully test your code. So... test
 your code!

 scan(), used with 'nlines', 'skip', 'sep', and 'what' will cut your read
 time by at least half while taking less RAM memory to do it, do most of your
 post processing, and give you something to better test your code. Or, don't
 use 'nlines' and lose your time/memory benefits over read.table(). 'skip'
 will get you right to the point before where things failed. That would be
 an interesting small segment of data to test with.

 wordpad can read your file (and then some). Eventually.

 Sincerely,
 KeithC.

 -Original Message-
 From: Stella Pachidi [mailto:stella.pach...@gmail.com]
 Sent: Monday, April 19, 2010 2:07 PM
 To: r-h...@stat.math.ethz.ch
 Subject: [R] Huge data sets and RAM problems

 Dear all,

 This is the first time I am sending mail to the mailing list, so I hope I do
 not make a mistake...

 The last months I have been working on my MSc thesis project on performing
 data mining techniques on user logs of a software-as-a-service application.
 The main problem  I am experiencing is how to process the huge amount of
 data. More specifically:

 I am using R 2.10.1 in a laptop with Windows 7 - 32bit system, 2GB RAM and
 CPU Intel Core Duo 2GHz.

 The user logs data come from a query Crystal report (.rpt file) which I
 transform with some Java code into a tab separated file.

 Although with a small subset of my data everything manages to run, when I
 increase the data set I get several problems:

 The first problem is with the use of read.delim(). When  I try to read a big
 amount of data  (over 2.400.000 rows and 18 attributes at each
 row) it doesn't seem to transform all table into a data frame. In
 particular, the data frame returned has 1.220.987 rows.

 Furthermore, as one of the data attributes is DataTime, when I try to split
 this column into two columns (one with Data and one with the Time), the
 returned result is quite strange, as the two new columns appear to have more
 rows than the data frame:

 applicLog.dat - read.delim(file.txt)
 #Process the syscreated column (Date time -- Date + time) copyDate -
 applicLog.dat[[ï..syscreated]] copyDate - as.character(copyDate)
 splitDate - strsplit(copyDate,  ) splitDate - unlist(splitDate)
 splitDateIndex - c(1:length(splitDate)) sysCreatedDate -
 splitDate[splitDateIndex %% 2 == 1] sysCreatedTime -
 splitDate[splitDateIndex %% 2 == 0] sysCreatedDate -
 strptime(sysCreatedDate, format=%Y-%m-%d) op - options(digits.secs = 3)
 sysCreatedTime - strptime(sysCreatedTime, format =%H:%M:%OS)
 applicLog.dat[[ï..syscreated]] - NULL applicLog.dat - cbind
 (sysCreatedDate,sysCreatedTime,applicLog.dat)

 Then I get the error: Error in data.frame(..., check.names = FALSE) :
  arguments imply differing number of rows: 1221063, 1221062, 1220987


 Finally, another problem I have is when I perform association mining on the
 data set using the package arules: I turn the data frame into transactions
 table and then run the apriori algorithm. When I put too low support in
 order to manage to find the rules I need, the vector of rules becomes too
 big and I get problems with the memory such as:
 Error: cannot allocate vector of size 923.1 Mb In addition: Warning
 messages:
 1: In items(x) : Reached total allocation of 153Mb: see help(memory.size)

 Could you please help me with how I could allocate more RAM? Or, do you
 think there is a way to process the data by loading them into a document
 instead of loading all into RAM? Do you know how I could manage to read all
 my data set?

 I would really appreciate your help.

 Kind regards,
 Stella Pachidi

 PS: Do you know any text editor that can read huge .txt files?





 --
 Stella Pachidi
 Master in Business Informatics student
 Utrecht University







-- 
Stella Pachidi
Master in Business Informatics student
Utrecht University
email: s.pach...@students.uu.nl
tel: +31644478898

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Huge data sets and RAM problems

2010-04-19 Thread Stella Pachidi
Dear all,

This is the first time I am sending mail to the mailing list, so I
hope I do not make a mistake...

The last months I have been working on my MSc thesis project on
performing data mining techniques on user logs of a
software-as-a-service application. The main problem  I am experiencing
is how to process the huge amount of data. More specifically:

I am using R 2.10.1 in a laptop with Windows 7 - 32bit system, 2GB RAM
and CPU Intel Core Duo 2GHz.

The user logs data come from a query Crystal report (.rpt file) which
I transform with some Java code into a tab separated file.

Although with a small subset of my data everything manages to run,
when I increase the data set I get several problems:

The first problem is with the use of read.delim(). When  I try to read
a big amount of data  (over 2.400.000 rows and 18 attributes at each
row) it doesn't seem to transform all table into a data frame. In
particular, the data frame returned has 1.220.987 rows.

Furthermore, as one of the data attributes is DataTime, when I try to
split this column into two columns (one with Data and one with the
Time), the returned result is quite strange, as the two new columns
appear to have more rows than the data frame:

applicLog.dat - read.delim(file.txt)
#Process the syscreated column (Date time -- Date + time)
copyDate - applicLog.dat[[ï..syscreated]]
copyDate - as.character(copyDate)
splitDate - strsplit(copyDate,  )
splitDate - unlist(splitDate)
splitDateIndex - c(1:length(splitDate))
sysCreatedDate - splitDate[splitDateIndex %% 2 == 1]
sysCreatedTime - splitDate[splitDateIndex %% 2 == 0]
sysCreatedDate - strptime(sysCreatedDate, format=%Y-%m-%d)
op - options(digits.secs = 3)
sysCreatedTime - strptime(sysCreatedTime, format =%H:%M:%OS)
applicLog.dat[[ï..syscreated]] - NULL
applicLog.dat - cbind (sysCreatedDate,sysCreatedTime,applicLog.dat)

Then I get the error: Error in data.frame(..., check.names = FALSE) :
  arguments imply differing number of rows: 1221063, 1221062, 1220987


Finally, another problem I have is when I perform association mining
on the data set using the package arules: I turn the data frame into
transactions table and then run the apriori algorithm. When I put too
low support in order to manage to find the rules I need, the vector of
rules becomes too big and I get problems with the memory such as:
Error: cannot allocate vector of size 923.1 Mb
In addition: Warning messages:
1: In items(x) : Reached total allocation of 153Mb: see help(memory.size)

Could you please help me with how I could allocate more RAM? Or, do
you think there is a way to process the data by loading them into a
document instead of loading all into RAM? Do you know how I could
manage to read all my data set?

I would really appreciate your help.

Kind regards,
Stella Pachidi

PS: Do you know any text editor that can read huge .txt files?





--
Stella Pachidi
Master in Business Informatics student
Utrecht University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.