Re: [R] Forming Portfolios for Fama / French Regression
Kai, Your question is best addressed to r-sig-fina...@stat.math.ethz.ch as it is finance related question. Jude ___ Jude Ryan Director, Client Analytical Services Strategy Business Development UBS Financial Services Inc. 1200 Harbor Boulevard, 4th Floor Weehawken, NJ 07086-6791 Tel. 201-352-1935 Fax 201-272-2914 Email: jude.r...@ubs.com Please do not transmit orders or instructions regarding a UBS account electronically, including but not limited to e-mail, fax, text or instant messaging. The information provided in this e-mail or any attachments is not an official transaction confirmation or account statement. For your protection, do not include account numbers, Social Security numbers, credit card numbers, passwords or other non-public information in your e-mail. Because the information contained in this message may be privileged, confidential, proprietary or otherwise protected from disclosure, please notify us immediately by replying to this message and deleting it from your computer if you have received this communication in error. Thank you. UBS Financial Services Inc. UBS Financial Services Incorporated of Puerto Rico UBS AG\ \ \ UBS reserves the right to retain all message...{{dropped:7}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Bhattacharyya distance metric
The Bhattacharyya distance is different from the Mahalanobis distance. See: http://en.wikipedia.org/wiki/Bhattacharyya_distance There is also the Hellinger Distance and the Rao distance. For the Rao distance, see: http://www.scholarpedia.org/article/Fisher-Rao_metric Jude ___ Jude Ryan Director, Client Analytical Services Strategy Business Development UBS Financial Services Inc. 1200 Harbor Boulevard, 4th Floor Weehawken, NJ 07086-6791 Tel. 201-352-1935 Fax 201-272-2914 Email: jude.r...@ubs.com Please do not transmit orders or instructions regarding a UBS account electronically, including but not limited to e-mail, fax, text or instant messaging. The information provided in this e-mail or any attachments is not an official transaction confirmation or account statement. For your protection, do not include account numbers, Social Security numbers, credit card numbers, passwords or other non-public information in your e-mail. Because the information contained in this message may be privileged, confidential, proprietary or otherwise protected from disclosure, please notify us immediately by replying to this message and deleting it from your computer if you have received this communication in error. Thank you. UBS Financial Services Inc. UBS International Inc. UBS Financial Services Incorporated of Puerto Rico UBS AG UBS reserves the right to retain all messages. Messages are protected and accessed only in legally justified cases.__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] compute differences
Alessandro Carletti wrote: Hi, I have a problem. I have a data frame looking like: ID val A? .3 B? 1.2 C? 3.4 D? 2.2 E? 2.0 I need to CREATE the following TABLE: CASE?? DIFF A-A??? 0 A-B??? -0.9 A-C??? -3.1 A-D??? -1.9 A-E??? -1.7 B-A??? ... B-B??? ... B-C B-D B-E C-A ... WHERE CASE IS THE COUPLE OF ELEMENTS CONSIDEREDM AND DIFF IS THE computed DIFFERENCE between their values. Could you give me suggestions? Solution: Besides the suggestions given by others, you can use the sqldf package to do this (leveraging knowledge in SQL if you know SQL). If you join your data frame with itself, without a join condition, you will get the Cartesian product of the two data frames, which seems to be exactly what you need. A warning is in order. Generally when you join 2 (or more) data frames you DO NOT want the Cartesian product by want to join the data frames by some key. The solution to your particular problem, however, can be implemented easily using the Cartesian product. mydata - data.frame(id=rep(c('A','B','C','D','E'), each=2), val=sample(1:5, 10, replace=T)) mydata library(sqldf) # merge data frame with itself to create a Cartesian Product - this is normally NOT what you want. # Note 'case' is a key word in SQL so I use cases for the variable name. Likewise diff is a used in R so I use diffr mydata2 - sqldf(select a.id as id1, a.val as val1, b.id as id2, b.val as val2, a.id || ' - ' || b.id as cases, a.val - b.val as diffr from mydata a, mydata b) dim(mydata2) # check dimensions of the merged dataset head(mydata2) # examine the first 6 records # if you want only the columns casses and diffr, then use this SQL code mydata3 - sqldf(select a.id || ' - ' || b.id as cases, a.val - b.val as diffr from mydata a, mydata b) dim(mydata3) # check dimensions of the merged dataset head(mydata3) # examine the first 6 records Hope this helps. Jude ___ Jude Ryan Director, Client Analytical Services Strategy Business Development UBS Financial Services Inc. 1200 Harbor Boulevard, 4th Floor Weehawken, NJ 07086-6791 Tel. 201-352-1935 Fax 201-272-2914 Email: jude.r...@ubs.com Please do not transmit orders or instructions regarding a UBS account electronically, including but not limited to e-mail, fax, text or instant messaging. The information provided in this e-mail or any attachments is not an official transaction confirmation or account statement. For your protection, do not include account numbers, Social Security numbers, credit card numbers, passwords or other non-public information in your e-mail. Because the information contained in this message may be privileged, confidential, proprietary or otherwise protected from disclosure, please notify us immediately by replying to this message and deleting it from your computer if you have received this communication in error. Thank you. UBS Financial Services Inc. UBS International Inc. UBS Financial Services Incorporated of Puerto Rico UBS AG UBS reserves the right to retain all messages. Messages are protected and accessed only in legally justified cases.__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] compute differences
Thanks Petr! It is good to see multiple solutions to the same problem. Best, Jude -Original Message- From: Petr PIKAL [mailto:petr.pi...@precheza.cz] Sent: Wednesday, September 23, 2009 10:59 AM To: Ryan, Jude Cc: alxmil...@yahoo.it; r-help@r-project.org Subject: Re: [R] compute differences Hi You can use outer. If your data are in data frame test then DIFF - as.vector(t(outer(test$val, test$val, -))) returns a vector, You just need to add suitable names to rows. CASE - as.vector(t(outer(test$ID, test$ID, paste, sep=-))) data.frame(CASE, DIFF) will put it together. Regards Petr r-help-boun...@r-project.org napsal dne 23.09.2009 16:42:45: Alessandro Carletti wrote: Hi, I have a problem. I have a data frame looking like: ID val A? .3 B? 1.2 C? 3.4 D? 2.2 E? 2.0 I need to CREATE the following TABLE: CASE?? DIFF A-A??? 0 A-B??? -0.9 A-C??? -3.1 A-D??? -1.9 A-E??? -1.7 B-A??? ... B-B??? ... B-C B-D B-E C-A ... WHERE CASE IS THE COUPLE OF ELEMENTS CONSIDEREDM AND DIFF IS THE computed DIFFERENCE between their values. Could you give me suggestions? Solution: Besides the suggestions given by others, you can use the sqldf package to do this (leveraging knowledge in SQL if you know SQL). If you join your data frame with itself, without a join condition, you will get the Cartesian product of the two data frames, which seems to be exactly what you need. A warning is in order. Generally when you join 2 (or more) data frames you DO NOT want the Cartesian product by want to join the data frames by some key. The solution to your particular problem, however, can be implemented easily using the Cartesian product. mydata - data.frame(id=rep(c('A','B','C','D','E'), each=2), val=sample(1:5, 10, replace=T)) mydata library(sqldf) # merge data frame with itself to create a Cartesian Product - this is normally NOT what you want. # Note 'case' is a key word in SQL so I use cases for the variable name. Likewise diff is a used in R so I use diffr mydata2 - sqldf(select a.id as id1, a.val as val1, b.id as id2, b.val as val2, a.id || ' - ' || b.id as cases, a.val - b.val as diffr from mydata a, mydata b) dim(mydata2) # check dimensions of the merged dataset head(mydata2) # examine the first 6 records # if you want only the columns casses and diffr, then use this SQL code mydata3 - sqldf(select a.id || ' - ' || b.id as cases, a.val - b.val as diffr from mydata a, mydata b) dim(mydata3) # check dimensions of the merged dataset head(mydata3) # examine the first 6 records Hope this helps. Jude ___ Jude Ryan Director, Client Analytical Services Strategy Business Development UBS Financial Services Inc. 1200 Harbor Boulevard, 4th Floor Weehawken, NJ 07086-6791 Tel. 201-352-1935 Fax 201-272-2914 Email: jude.r...@ubs.com Please do not transmit orders or instructions regarding a UBS account electronically, including but not limited to e-mail, fax, text or instant messaging. The information provided in this e-mail or any attachments is not an official transaction confirmation or account statement. For your protection, do not include account numbers, Social Security numbers, credit card numbers, passwords or other non-public information in your e-mail. Because the information contained in this message may be privileged, confidential, proprietary or otherwise protected from disclosure, please notify us immediately by replying to this message and deleting it from your computer if you have received this communication in error. Thank you. UBS Financial Services Inc. UBS International Inc. UBS Financial Services Incorporated of Puerto Rico UBS AG UBS reserves the right to retain all messages. Messages are protected and accessed only in legally justified cases.__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Please do not transmit orders or instructions regarding a UBS account electronically, including but not limited to e-mail, fax, text or instant messaging. The information provided in this e-mail or any attachments is not an official transaction confirmation or account statement. For your protection, do not include account numbers, Social Security numbers, credit card numbers, passwords or other non-public information in your e-mail. Because the information contained in this message may be privileged, confidential, proprietary or otherwise protected from disclosure, please notify us immediately by replying to this message and deleting it from your computer if you have received this communication
Re: [R] Recursive partitioning algorithms in R vs. alia
Thanks for your point of view Terry! It is always fascinating to follow the history of the field, especially as told by someone involved with it. Jude Ryan -Original Message- From: Terry Therneau [mailto:thern...@mayo.edu] Sent: Tuesday, June 23, 2009 9:22 AM To: Ryan, Jude; c...@datanalytics.com Cc: r-help@r-project.org Subject: Re: [R] Recursive partitioning algorithms in R vs. alia A point of history: Both the commercial CART program and the rpart() function are based on the book Classification and Regression Trees (Breiman, Friedman, Olshen, Stone, 1984). As a reader/commentator on one of the early drafts I got to know the material well. CART started as a large Fortran program written by Jerry Friedman which was the testing ground for the ideas in the book. I had the code at one time and made some modifications to it, but found it too frustrating to go very far with. Fortran is just too clumsy for a recursive task, and Jerry's ability to hold upteen variables in his head at once greater than mine -- the Fortran was a large monlithic block. Salford Systems aquired rights to that code; I don't know whether any of the original lines remain in their product. I had lots of conversations with their main programmer (15-20 years ago now) about methods for speeding it up; mainly an interesting problem in optimal indexing. When rpart was first written it's output agreed with CART almost entirely. The only major difference was in surrogates: I pick the surrogate with the largest number of agreements, CART picked that with the greatest % agreement. This means that rpart favors variables with fewer missing values. Since that point in time both codes have evolved. I haven't had time to do important work on rpart in over a decade. It' not surprising that the graphics and display are behind the curve, what's more surprising is that it still endures. Rpart is called rpart because the authors copyrighted the term CART for their program. It was the best alternative name that I could come up with at the time. I find it amusing that one consequence of their copyright choice is that I now see recursive partitioning far more often than CART as the generic label for tree based methods. Terry T Please do not transmit orders or instructions regarding a UBS account electronically, including but not limited to e-mail, fax, text or instant messaging. The information provided in this e-mail or any attachments is not an official transaction confirmation or account statement. For your protection, do not include account numbers, Social Security numbers, credit card numbers, passwords or other non-public information in your e-mail. Because the information contained in this message may be privileged, confidential, proprietary or otherwise protected from disclosure, please notify us immediately by replying to this message and deleting it from your computer if you have received this communication in error. Thank you. UBS Financial Services Inc. UBS International Inc. UBS Financial Services Incorporated of Puerto Rico UBS AG\ \ \ UBS reserves the right to retain all messag...{{dropped:6}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Recursive partitioning algorithms in R vs. alia
I have used all 3 packages for decision trees (SAS/EM, CART and R). As another user on the list commented, the algorithms CART uses are proprietary. I also know that since the algorithms are proprietary, the decision tree that you get from SAS is based on a slightly different algorithm so as to not violate copyright laws. When I first started using R (rpart) I benchmarked it (in terms of results obtained) for my particular problem at the time against Salford Systems CART. R gave me an identical tree with the splitting value being different in the 2nd or 3rd decimal place from what I recall. I did not have SAS/EM at that particular company and so could not benchmark it. Salford Systems CART does have additional types of splitting criteria such as towing etc., but again, these may be of value in certain types of problems. The splitting criteria found in R are good enough. I do have SAS/EM right now but prefer R to SAS/EM since R can be programmed and SAS/EM cannot. This may not be relevant for decision trees but for neural networks, for example, if I want to build hundreds of neural networks (since there are no variable selection methods for neural networks) with different predictors and different number of neurons, I can do this easily in R but cannot do this in SAS/EM. SAS/EM does have a variable selection node but that is independent of the neural network node, so, from what I understand, you have to select the variables and then pass them to the neural network node. In general, you get prettier output with CART and SAS/EM for trees. However, there are packages in R that can give you prettier output than rpart does. One GUI that you may want to explore, that works with R, is Rattle. This builds trees, neural network, boosting, etc. and you can see the generated R code as well. In terms of handling large volumes of data, SAS/EM is probably the best. However, if you have a 64 bit operating system with lots of RAM, and use random sampling, R should suffice. It is debatable whether the extra features like pretty output and variable importance is worth the huge costs you have to pay for those products, unless you really need these features. With R you can do what you want, and that is build a good tree. From what I have read, variable importance measures can be biased as they are affected by factors such as multicollinearity, variables with many categories, etc., so their usefulness is questionable (however, end-users may love them). SAS/EM is by far the most expensive product, and Salford Systems CART is pretty expensive as well. So depending on your needs, R may be good enough or the best, because you can program it, and the latest methodologies will always be implemented in R first. For comparisons of the programming capabilities of SAS (macros) versus R you may want to look at what Frank Harrell and Terry Thearneu (who wrote rpart) have to say. Both are experts in SAS and R. Hope this helps. Jude Carlos wrote: Dear R-helpers, I had a conversation with a guy working in a business intelligence department at a major Spanish bank. They rely on recursive partitioning methods to rank customers according to certain criteria. They use both SAS EM and Salford Systems' CART. I have used package R part in the past, but I could not provide any kind of feature comparison or the like as I have no access to any installation of the first two proprietary products. Has anybody experience with them? Is there any public benchmark available? Is there any very good --although solely technical-- reason to pay hefty software licences? How would the algorithms implemented in rpart compare to those in SAS and/or CART? Best regards, Carlos J. Gil Bellosta http://www.datanalytics.com http://www.datanalytics.com/ ___ Jude Ryan Director, Client Analytical Services Strategy Business Development UBS Financial Services Inc. 1200 Harbor Boulevard, 4th Floor Weehawken, NJ 07086-6791 Tel. 201-352-1935 Fax 201-272-2914 Email: jude.r...@ubs.com Please do not transmit orders or instructions regarding a UBS account electronically, including but not limited to e-mail, fax, text or instant messaging. The information provided in this e-mail or any attachments is not an official transaction confirmation or account statement. For your protection, do not include account numbers, Social Security numbers, credit card numbers, passwords or other non-public information in your e-mail. Because the information contained in this message may be privileged, confidential, proprietary or otherwise protected from disclosure, please notify us immediately by replying to this message and deleting it from your computer if you have received this communication in error. Thank you. UBS Financial Services Inc. UBS International Inc. UBS Financial Services Incorporated of Puerto Rico UBS AG UBS reserves the right to retain all messages.
Re: [R] Problem in 'Apply' function: does anybody have other solution
David Winsemius' solution: apply(data.matrix(df), 1, I) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x12345678910 y13456789 10 2 For y and [,2] above the value is 3. Why is the value not 2? It looks like the value is 2 for y and [,10] (this should be 10, right?) and values 3 to 10 are shifted one position to the left for y. I got the same results when I ran this code. Thanks, Jude David Winsemius wrote: On Jun 17, 2009, at 9:27 AM, jim holtman wrote: Do an 'str' of your object. It looks like one of the columns is probably character/factor since there are quotes around the 'numbers'. You can also explicity convert the offending columns to numeric is you want to. Also use colClasses on the read.csv to define the class of the data in each column. This will should you where the error is. One function that might be of use is data.matrix which will attempt to convert character vectors to numeric vectors across an entire dataframe. I hope this is not beating a dead horse, but see if these examples are helpful in any way: ?data.matrix df - data.frame(x=1:10,y=as.character(1:10)) df x y 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10# not all is as it seems apply(df,1,I) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x 1 2 3 4 5 6 7 8 9 10 y 1 2 3 4 5 6 7 8 9 10 df2 - data.frame(x=1:10,y=1:10) apply(df2,1,I) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x12345678910 y12345678910 str(df) 'data.frame': 10 obs. of 2 variables: $ x: int 1 2 3 4 5 6 7 8 9 10 $ y: Factor w/ 10 levels 1,10,2,3,..: 1 3 4 5 6 7 8 9 10 2 # so that's weird. y isn't even a character vector !?!? Such are the strange beasts called factors. # solution? or at least one strategy apply(data.matrix(df), 1, I) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x12345678910 y13456789 10 2 ___ Jude Ryan Director, Client Analytical Services Strategy Business Development UBS Financial Services Inc. 1200 Harbor Boulevard, 4th Floor Weehawken, NJ 07086-6791 Tel. 201-352-1935 Fax 201-272-2914 Email: jude.r...@ubs.com Please do not transmit orders or instructions regarding a UBS account electronically, including but not limited to e-mail, fax, text or instant messaging. The information provided in this e-mail or any attachments is not an official transaction confirmation or account statement. For your protection, do not include account numbers, Social Security numbers, credit card numbers, passwords or other non-public information in your e-mail. Because the information contained in this message may be privileged, confidential, proprietary or otherwise protected from disclosure, please notify us immediately by replying to this message and deleting it from your computer if you have received this communication in error. Thank you. UBS Financial Services Inc. UBS International Inc. UBS Financial Services Incorporated of Puerto Rico UBS AG UBS reserves the right to retain all messages. Messages are protected and accessed only in legally justified cases.__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Problem in 'Apply' function: does anybody have other solution
Thanks! I did not look at the output of str(df) closely. Since y is defined as a character variable when df is created (but stored as a factor), it looks like str(df) is sorting the factors, at least when it is displayed to the screen. Jude -Original Message- From: David Winsemius [mailto:dwinsem...@comcast.net] Sent: Thursday, June 18, 2009 11:22 AM To: Ryan, Jude Cc: r-help@r-project.org Subject: Re: [R] Problem in 'Apply' function: does anybody have other solution It's not a solution. Unfortunately data.matrix is no different with respect to factors than other functions. Note what str(df) produced for df$y. -- David. On Jun 18, 2009, at 10:59 AM, jude.r...@ubs.com jude.r...@ubs.com wrote: David Winsemius' solution: apply(data.matrix(df), 1, I) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x12345678910 y13456789 10 2 For y and [,2] above the value is 3. Why is the value not 2? It looks like the value is 2 for y and [,10] (this should be 10, right?) and values 3 to 10 are shifted one position to the left for y. I got the same results when I ran this code. Thanks, Jude David Winsemius wrote: On Jun 17, 2009, at 9:27 AM, jim holtman wrote: Do an 'str' of your object. It looks like one of the columns is probably character/factor since there are quotes around the 'numbers'. You can also explicity convert the offending columns to numeric is you want to. Also use colClasses on the read.csv to define the class of the data in each column. This will should you where the error is. One function that might be of use is data.matrix which will attempt to convert character vectors to numeric vectors across an entire dataframe. I hope this is not beating a dead horse, but see if these examples are helpful in any way: ?data.matrix df - data.frame(x=1:10,y=as.character(1:10)) df x y 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10# not all is as it seems apply(df,1,I) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x 1 2 3 4 5 6 7 8 9 10 y 1 2 3 4 5 6 7 8 9 10 df2 - data.frame(x=1:10,y=1:10) apply(df2,1,I) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x12345678910 y12345678910 str(df) 'data.frame': 10 obs. of 2 variables: $ x: int 1 2 3 4 5 6 7 8 9 10 $ y: Factor w/ 10 levels 1,10,2,3,..: 1 3 4 5 6 7 8 9 10 2 # so that's weird. y isn't even a character vector !?!? Such are the strange beasts called factors. # solution? or at least one strategy apply(data.matrix(df), 1, I) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x12345678910 y13456789 10 2 ___ Jude Ryan Director, Client Analytical Services Strategy Business Development UBS Financial Services Inc. 1200 Harbor Boulevard, 4th Floor Weehawken, NJ 07086-6791 Tel. 201-352-1935 Fax 201-272-2914 Email: jude.r...@ubs.com Please do not transmit orders or instructions regarding a UBS account electronically, including but not limited to e-mail, fax, text or instant messaging. The information provided in this e-mail or any attachments is not an official transaction confirmation or account statement. For your protection, do not include account numbers, Social Security numbers, credit card numbers, passwords or other non-public information in your e-mail. Because the information contained in this message may be privileged, confidential, proprietary or otherwise protected from disclosure, please notify us immediately by replying to this message and deleting it from your computer if you have received this communication in error. Thank you. UBS Financial Services Inc. UBS International Inc. UBS Financial Services Incorporated of Puerto Rico UBS AG UBS reserves the right to retain all messages. Messages are protected and accessed only in legally justified cases. David Winsemius, MD Heritage Laboratories West Hartford, CT Please do not transmit orders or instructions regarding a UBS account electronically, including but not limited to e-mail, fax, text or instant messaging. The information provided in this e-mail or any attachments is not an official transaction confirmation or account statement. For your protection, do not include account numbers, Social Security numbers, credit card numbers, passwords or other non-public information in your e-mail. Because the information contained in this message may be privileged, confidential, proprietary or otherwise protected from disclosure, please notify us immediately by replying to this message and deleting it from your computer if you have received this communication in error. Thank
Re: [R] Inf in nnet final value for validation data
Andrea, You can calculate predictions for your validation data based on nnet objects using the predict function (the predict function can also be used for regressions, quantile regressions, etc.) If you create a neural net with the following code: library(nnet) # 3 hidden neurons, for classification (linout = F), and not a skip layer network (skip = F, or T if you want) mynet.nn - nnet(dependent_variable ~ ., data = train, size = 3, decay = 1e-3, linout = F, skip = F, maxit = 1000, Hess = T) # calculate predictions for your training data and append to data frame called train train$predictions - predict(mynet.nn) # calculate predictions for your validation data and append to data frame called valid valid$predictions - predict(mynet.nn, valid) # you need to pass your neural net object and your validation dataset to the predict function To just get the predictions for your validation dataset this is all you need. I do not know why you need to calculate the log likelihood. Hope this helps. Jude Andrea wrote: Hi, I use nnet for my classification problem and have a problem concerning the calculation of the final value for my validation data.(nnet only calculates the final value for the training data). I made my own final value formula (for the training data I get the same value as nnet): # prob-matrix pmatrix - cat*fittedValues tmp - rowSums(pmatrix) # -log likelihood finalValue - sum(-log(tmp)) # add penalty term finalValue + sum(decay * weights^2) where cat is a matrix with cols for each possible category and a row for each data record. The values are 1 for the target categories of a data record and 0 otherwise. My problem is, that I get Inf-values for some validation data records, because the rowsum of cat*fittedValues gets 0 and the log gets Inf. Has anyone an idea how to deal with that problem properly? How does nnet? I´m thinking of a penalty value for those values. That means if cat*fittedValues == 0 not to calculate the log but add e.g. 100 instead of -log(tmp) to the finalValue-sum?? But how to determine the penalty value??? I´m looking forwar for all suggestions, Andrea. ___ Jude Ryan Director, Client Analytical Services Strategy Business Development UBS Financial Services Inc. 1200 Harbor Boulevard, 4th Floor Weehawken, NJ 07086-6791 Tel. 201-352-1935 Fax 201-272-2914 Email: jude.r...@ubs.com Please do not transmit orders or instructions regarding a UBS account electronically, including but not limited to e-mail, fax, text or instant messaging. The information provided in this e-mail or any attachments is not an official transaction confirmation or account statement. For your protection, do not include account numbers, Social Security numbers, credit card numbers, passwords or other non-public information in your e-mail. Because the information contained in this message may be privileged, confidential, proprietary or otherwise protected from disclosure, please notify us immediately by replying to this message and deleting it from your computer if you have received this communication in error. Thank you. UBS Financial Services Inc. UBS International Inc. UBS Financial Services Incorporated of Puerto Rico UBS AG UBS reserves the right to retain all messages. Messages are protected and accessed only in legally justified cases.__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Comparing R and SAs
Satish, For a comparison of SAS and S, see the document An Introduction to S and the Hmisc and Design Libraries by Carlos Alzola and Frank E. Harrell. Frank Harrell is an expert in both SAS and R. You can download this document from http://www.r-project.org/, then click on manuals, and then contributed documentation. You can also look at the document written by Bob Muenchen (at http://RforSASandSPSSusers.com http://rforsasandspssusers.com/ (also a book published by Springer Verlag) for a comparison of SAS and R (and SPSS). I have been using both SAS and R. While my primary expertise is mainly in SAS, I have been using R more and more relative to SAS as my familiarity with it grows. From my point of view, cutting edge methodologies will always be implemented first in R (as you pointed out as well). SAS will follow several years later with some of these methodologies. Also, SAS has different products and users may not have all SAS products. Many firms have SAS/STAT but not other SAS products like SAS/ETS (economic time series), SAS/Enterpriser Miner or SAS/GRAPH. So in these situations R may be your only option. Even if you have these other SAS products you can do things more rapidly in R, if you take the time to learn it well, than you can with SAS. I have SAS/Enterprise Miner but still prefer R for neural networks, splines, decision trees, etc., as I can program R to produce several neural networks, etc. using for loops. SAS/Enterprise Miner cannot be programmed. R graphs are definitely superior to SAS graphics, and can be programmed very easily. I also use R for EDA (exploratory data analysis) prior to building predictive models/data mining. One area where SAS still excels is in processing huge files (over 30 GB in size - online data from vendors like double click with literally billions of records). But for statistical analysis you generally don't need to work with such large volumes of data. A much smaller random sample should suffice. If you have R running on Unix or Linux 64-bit operating systems (or Windows Vista?) and huge amounts of RAM handling large datasets in R is less of an issue. Also, if your data resides on mainframes, SAS is probably your only choice if you cannot download the mainframe data to your PC. I use R on a 32-bit Windows operating system with 3 GB of RAM, and I have not had any problem doing statistical analysis/data mining with R on around 25,000 or so records with anywhere from 25 to 50 variables. Hope this helps. Jude Satish wrote: Hi: For those of you who are adept at both SAS and R, I have the following questions: a) What are some reasons / tasks for which you would use R over SAS and vice versa? b) What are some things for which R is a must have that SAS cannot fulfill the requirements? I am on the ramp up on both of them. The general feeling that I am getting by following this group is that R updates to the product are at a much faster pace and therefore, this would be better for someone who wants the bleeding edge (correct me if I am wrong). But I am also interested in what is inherently better in R that SAS cannot offer perhaps because of the design. Thanks. Satish ___ Jude Ryan Director, Client Analytical Services Strategy Business Development UBS Financial Services Inc. 1200 Harbor Boulevard, 4th Floor Weehawken, NJ 07086-6791 Tel. 201-352-1935 Fax 201-272-2914 Email: jude.r...@ubs.com Please do not transmit orders or instructions regarding a UBS account electronically, including but not limited to e-mail, fax, text or instant messaging. The information provided in this e-mail or any attachments is not an official transaction confirmation or account statement. For your protection, do not include account numbers, Social Security numbers, credit card numbers, passwords or other non-public information in your e-mail. Because the information contained in this message may be privileged, confidential, proprietary or otherwise protected from disclosure, please notify us immediately by replying to this message and deleting it from your computer if you have received this communication in error. Thank you. UBS Financial Services Inc. UBS International Inc. UBS Financial Services Incorporated of Puerto Rico UBS AG UBS reserves the right to retain all messages. Messages are protected and accessed only in legally justified cases.__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] warning message when running quantile regression
Hi All, I am running quantile regression in a for loop starting with 1 variable and adding a variable at a time reaching a maximum of 20 variables. I get the following warning messages after my for loop runs. Should I be concerned about these messages? I am building predictive models and am not interested in inference. Warning messages: 1: In summary.rq(quantreg.emaff) : 3 non-positive fis - I don't understand this message - is this a cause for concern? 2: In summary.rq(quantreg.emaff) : 3 non-positive fis 3: In summary.rq(quantreg.emaff) : 5 non-positive fis 4: In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique 5: In summary.rq(quantreg.emaff) : 6 non-positive fis 6: In summary.rq(quantreg.emaff) : 5 non-positive fis 7: In summary.rq(quantreg.emaff) : 5 non-positive fis 8: In summary.rq(quantreg.emaff) : 7 non-positive fis 9: In summary.rq(quantreg.emaff) : 10 non-positive fis 10: In summary.rq(quantreg.emaff) : 9 non-positive fis 11: In summary.rq(quantreg.emaff) : 8 non-positive fis 12: In summary.rq(quantreg.emaff) : 9 non-positive fis 13: In summary.rq(quantreg.emaff) : 8 non-positive fis 14: In summary.rq(quantreg.emaff) : 11 non-positive fis I understand the non-unique solution message. Thanks in advance, Jude Ryan ___ Jude Ryan Director, Client Analytical Services Strategy Business Development UBS Financial Services Inc. 1200 Harbor Boulevard, 4th Floor Weehawken, NJ 07086-6791 Tel. 201-352-1935 Fax 201-272-2914 Email: jude.r...@ubs.com Please do not transmit orders or instructions regarding a UBS account electronically, including but not limited to e-mail, fax, text or instant messaging. The information provided in this e-mail or any attachments is not an official transaction confirmation or account statement. For your protection, do not include account numbers, Social Security numbers, credit card numbers, passwords or other non-public information in your e-mail. Because the information contained in this message may be privileged, confidential, proprietary or otherwise protected from disclosure, please notify us immediately by replying to this message and deleting it from your computer if you have received this communication in error. Thank you. UBS Financial Services Inc. UBS International Inc. UBS Financial Services Incorporated of Puerto Rico UBS AG UBS reserves the right to retain all messages. Messages are protected and accessed only in legally justified cases.__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Backpropagation to adjust weights in a neural net when receiving new training examples
You can figure out which weights go with which connections with the function summary(nnet.object) and nnet.object$wts. Sample code from Venables and Ripley is below: # Neural Network model in Modern Applied Statistics with S, Venables and Ripley, pages 246 and 247 library(nnet) attach(rock) dim(rock) [1] 48 4 area1 - area/1; peri1 - peri/1 rock1 - data.frame(perm, area = area1, peri = peri1, shape) dim(rock1) [1] 48 4 head(rock1) perm area peri shape 1 6.3 0.4990 0.279190 0.0903296 2 6.3 0.7002 0.389260 0.1486220 3 6.3 0.7558 0.393066 0.1833120 4 6.3 0.7352 0.386932 0.1170630 5 17.1 0.7943 0.394854 0.1224170 6 17.1 0.7979 0.401015 0.1670450 rock.nn - nnet(log(perm) ~ area + peri + shape, rock1, size=3, decay=1e-3, linout=T, skip=T, maxit=1000, Hess=T) # weights: 19 initial value 1196.787489 iter 10 value 32.400984 iter 20 value 31.664545 ... iter 280 value 14.230077 iter 290 value 14.229809 final value 14.229785 converged summary(rock.nn) a 3-3-1 network with 19 weights options were - skip-layer connections linear output units decay=0.001 b-h1 i1-h1 i2-h1 i3-h1 -0.51 -9.33 14.59 3.85 b-h2 i1-h2 i2-h2 i3-h2 0.93 3.35 6.09 -5.86 b-h3 i1-h3 i2-h3 i3-h3 0.80 -10.93 -4.58 9.53 b-o h1-o h2-o h3-o i1-o i2-o i3-o 1.89 -14.62 7.35 8.77 -3.00 -4.25 4.44 sum((log(perm) - predict(rock.nn))^2) [1] 13.20451 rock.nn$wts [1] -0.5064848 -9.3288410 14.5859255 3.8521844 0.9266730 3.3524267 6.0900909 -5.8628448 0.8026366 -10.9345352 -4.5783516 9.5311123 [13] 1.8866734 -14.6181959 7.3466236 8.7655882 -2.9988287 -4.2508948 4.4397158 In the output from summary(rock.nn), b is the bias or intercept, h1 is the 1st hidden neuron, i1 is the first input (peri) and o is the (linear) output. So b-h1 is the bias or intercept to the first hidden neuron, i1-h1 is the 1st input (peri) to the first hidden neuron (there are 3 hidden neurons in this example), h1-o is the 1st hidden neuron to the first output, and i1-o is the first input to the output (since skip=T - this is a skip layer network). The weights are below (b-h1 ..) but are rounded. But rock.nn$wts gives you the un-rounded weights. If you compare the output from summary(rock.nn) and rock.nn$wts you will see that the first row of weights from summary() is listed first in rock.nn$wts, followed by the 2nd row of weights from summary() and so on. You can construct the neural network equations manually (this is not in the Venables Ripley book) and check the results against the predict() function to verify that the weights are listed in the order I described. The code to do this is: # manually calculate the neural network predictions based on the neural network equations rock1$h1 - -0.5064848 -9.3288410 * rock1$area + 14.5859255 * rock1$peri + 3.8521844 * rock1$shape rock1$logistic_h1 - exp(rock1$h1) / (1 + exp(rock1$h1)) rock1$h2 - 0.9266730 + 3.3524267 * rock1$area + 6.0900909 * rock1$peri -5.8628448 * rock1$shape rock1$logistic_h2 - exp(rock1$h2) / (1 + exp(rock1$h2)) rock1$h3 - 0.8026366 - 10.9345352 * rock1$area - 4.5783516 * rock1$peri + 9.5311123 * rock1$shape rock1$logistic_h3 - exp(rock1$h3) / (1 + exp(rock1$h3)) rock1$pred1 - (1.8866734 - 14.6181959 * rock1$logistic_h1 + 7.3466236 * rock1$logistic_h2 + 8.7655882 * rock1$logistic_h3 - 2.9988287 * rock1$area - 4.2508948 * rock1$peri + 4.4397158 * rock1$shape) rock1$nn.pred - predict(rock.nn) head(rock1) perm area peri shape h1 logistic_h1 h2 logistic_h2h3 logistic_h3pred1 nn.pred 1 6.3 0.4990 0.279190 0.0903296 -0.7413656 0.3227056 3.770238 0.9774726 -5.070985 0.0062370903 2.122910 2.122910 2 6.3 0.7002 0.389260 0.1486220 -0.7883026 0.3125333 4.773323 0.9916186 -7.219361 0.0007317343 1.514820 1.514820 3 6.3 0.7558 0.393066 0.1833120 -1.1178398 0.2464122 4.779515 0.9916699 -7.514112 0.0005450367 2.451231 2.451231 4 6.3 0.7352 0.386932 0.1170630 -1.2703391 0.2191992 5.061506 0.9937039 -7.892204 0.0003735057 2.656199 2.656199 5 17.1 0.7943 0.394854 0.1224170 -1.6854993 0.1563686 5.276490 0.9949156 -8.523675 0.0001986684 3.394902 3.394902 6 17.1 0.7979 0.401015 0.1670450 -1.4573040 0.100 5.064433 0.9937222 -8.165892 0.0002841023 3.072776 3.072776 The first 6 records show that the numbers from the manual equations and the predict() function are the same (the last 2 columns). As VR point out in their book, there are several solutions and a random starting point and if you run the same example your results may differ. Hope this helps. Jude Ryan Filipe Rocha wrote: I want to create a neural network, and then everytime it receives new data, instead of creating a new nnet, i want to use a backpropagation algorithm to adjust the weights in the already created nn. I'm using nnet package, I know that nn$wts gives the weights,
Re: [R] Backpropagation to adjust weights in a neural net when receiving new training examples
Not that I know of. If you do come across any, let me know, or better still, email r-help. Good luck with what you are trying to do. Jude Ryan From: Filipe Rocha [mailto:filipemaro...@gmail.com] Sent: Friday, May 29, 2009 1:17 PM To: Ryan, Jude Cc: r-help@r-project.org Subject: Re: [R] Backpropagation to adjust weights in a neural net when receiving new training examples Thanks a lot for your answers. I can try to implement backpropagation myself with that information. But there isnt a function or method of backpropagation of error for new examples of training only to change the already created neural net? I want to implemennt reinforcement learning... Thanks in advance Filipe Rocha 2009/5/29 jude.r...@ubs.com You can figure out which weights go with which connections with the function summary(nnet.object) and nnet.object$wts. Sample code from Venables and Ripley is below: # Neural Network model in Modern Applied Statistics with S, Venables and Ripley, pages 246 and 247 library(nnet) attach(rock) dim(rock) [1] 48 4 area1 - area/1; peri1 - peri/1 rock1 - data.frame(perm, area = area1, peri = peri1, shape) dim(rock1) [1] 48 4 head(rock1) perm area peri shape 1 6.3 0.4990 0.279190 0.0903296 2 6.3 0.7002 0.389260 0.1486220 3 6.3 0.7558 0.393066 0.1833120 4 6.3 0.7352 0.386932 0.1170630 5 17.1 0.7943 0.394854 0.1224170 6 17.1 0.7979 0.401015 0.1670450 rock.nn - nnet(log(perm) ~ area + peri + shape, rock1, size=3, decay=1e-3, linout=T, skip=T, maxit=1000, Hess=T) # weights: 19 initial value 1196.787489 iter 10 value 32.400984 iter 20 value 31.664545 ... iter 280 value 14.230077 iter 290 value 14.229809 final value 14.229785 converged summary(rock.nn) a 3-3-1 network with 19 weights options were - skip-layer connections linear output units decay=0.001 b-h1 i1-h1 i2-h1 i3-h1 -0.51 -9.33 14.59 3.85 b-h2 i1-h2 i2-h2 i3-h2 0.93 3.35 6.09 -5.86 b-h3 i1-h3 i2-h3 i3-h3 0.80 -10.93 -4.58 9.53 b-o h1-o h2-o h3-o i1-o i2-o i3-o 1.89 -14.62 7.35 8.77 -3.00 -4.25 4.44 sum((log(perm) - predict(rock.nn))^2) [1] 13.20451 rock.nn$wts [1] -0.5064848 -9.3288410 14.5859255 3.8521844 0.9266730 3.3524267 6.0900909 -5.8628448 0.8026366 -10.9345352 -4.5783516 9.5311123 [13] 1.8866734 -14.6181959 7.3466236 8.7655882 -2.9988287 -4.2508948 4.4397158 In the output from summary(rock.nn), b is the bias or intercept, h1 is the 1st hidden neuron, i1 is the first input (peri) and o is the (linear) output. So b-h1 is the bias or intercept to the first hidden neuron, i1-h1 is the 1st input (peri) to the first hidden neuron (there are 3 hidden neurons in this example), h1-o is the 1st hidden neuron to the first output, and i1-o is the first input to the output (since skip=T - this is a skip layer network). The weights are below (b-h1 ..) but are rounded. But rock.nn$wts gives you the un-rounded weights. If you compare the output from summary(rock.nn) and rock.nn$wts you will see that the first row of weights from summary() is listed first in rock.nn$wts, followed by the 2nd row of weights from summary() and so on. You can construct the neural network equations manually (this is not in the Venables Ripley book) and check the results against the predict() function to verify that the weights are listed in the order I described. The code to do this is: # manually calculate the neural network predictions based on the neural network equations rock1$h1 - -0.5064848 -9.3288410 * rock1$area + 14.5859255 * rock1$peri + 3.8521844 * rock1$shape rock1$logistic_h1 - exp(rock1$h1) / (1 + exp(rock1$h1)) rock1$h2 - 0.9266730 + 3.3524267 * rock1$area + 6.0900909 * rock1$peri -5.8628448 * rock1$shape rock1$logistic_h2 - exp(rock1$h2) / (1 + exp(rock1$h2)) rock1$h3 - 0.8026366 - 10.9345352 * rock1$area - 4.5783516 * rock1$peri + 9.5311123 * rock1$shape rock1$logistic_h3 - exp(rock1$h3) / (1 + exp(rock1$h3)) rock1$pred1 - (1.8866734 - 14.6181959 * rock1$logistic_h1 + 7.3466236 * rock1$logistic_h2 + 8.7655882 * rock1$logistic_h3 - 2.9988287 * rock1$area - 4.2508948 * rock1$peri + 4.4397158 * rock1$shape) rock1$nn.pred - predict(rock.nn) head(rock1) perm area peri shape h1 logistic_h1 h2 logistic_h2h3 logistic_h3pred1 nn.pred 1 6.3 0.4990 0.279190 0.0903296 -0.7413656 0.3227056 3.770238 0.9774726 -5.070985 0.0062370903 2.122910 2.122910 2 6.3 0.7002 0.389260 0.1486220 -0.7883026 0.3125333 4.773323 0.9916186 -7.219361 0.0007317343 1.514820 1.514820 3 6.3 0.7558 0.393066 0.1833120 -1.1178398 0.2464122 4.779515 0.9916699 -7.514112 0.0005450367 2.451231 2.451231 4 6.3 0.7352 0.386932 0.1170630 -1.2703391 0.2191992 5.061506 0.9937039 -7.892204 0.0003735057 2.656199 2.656199 5 17.1 0.7943 0.394854 0.1224170 -1.6854993 0.1563686 5.276490
Re: [R] Neural Network resource
The package AMORE appears to be more flexible, but I got very poor results using it when I tried to improve the predictive accuracy of a regression model. I don't understand all the options well enough to be able to fine tune it to get better predictions. However, using the nnet() function in package VR gave me decent results and is pretty easy to use (see the Venables and Ripley book, Modern Applied Statistics with S, pages 243 to 249, for more details). I tried using package neuralnet as well but the neural net failed to converge. I could not figure out how to set the threshold option (or other options) to get the neural net to converge. I explored package neural as well. Of all these 4 packages, the nnet() function in package VR worked the best for me. As another R user commented as well, you have too many hidden layers and too many neurons. In general you do not need more than 1 hidden layer. One hidden layer is sufficient for the universal approximator property of neural networks to hold true. As you keep adding neurons to the one hidden layer, the problem becomes more and more non-linear. If you add too many neurons you will overfit. In general, you do not need to add more than 10 neurons. The activation function in the hidden layer of Venables and Ripley's nnet() function is logistic, and you can specify the activation function in the output layer to be linear using linout = T in nnet(). Using one hidden layer, and starting with one hidden neuron and working up to 10 hidden neurons, I built several neural nets (4,000 records) and computed the training MSE. I also computed the validation MSE on a holdout sample of over 1,000 records. I also started with 2 variables and worked up to 15 variables in a for loop, so in all, I built 140 neural nets using 2 for loops, and stored the results in lists. I arranged my variables in the data frame based on correlations and partial correlations so that I could easily add variables in a for loop. This was my crude attempt to simulate variable selection since, from what I have seen, neural networks do not have variable selection methods. In my particular case, neural networks gave me marginally better results than regression. It all depends on the problem. If the data has non-linear patterns, neural networks will be better than linear regression. My code is below. You can modify it to suit your needs if you find it useful. There are probably lines in the code that are redundant which can be deleted. HTH. Jude Ryan My code: # set order in data frame train2 based on correlations and partial correlations train2 - train[, c(5,27,19,20,25,26,4,9,3,10,16,6,2,14,21,28)] dim(train2) names(train2) library(nnet) # skip = T # train 10 neural networks in a loop and find the one with the minimum test and validation error # create various lists to store the results of the neural network running in two for loops # The Column List is for the outer for loop, which loops over variables # The Row List is for the inner for loop, which loops over number of neurons in the hidden layer col_nn - list() # stores the results of nnet() over variables - outer loop row_nn - list() # stores the results of nnet() over neurons - inner loop col_mse - list() # row_mse - list() # not needed because nn.mse is a data frame with rows col_sum - list() row_sum - list() col_vars - list() row_vars - list() col_wts - list() row_wts - list() df_dim - dim(train2) df_dim[2] # number of variables df_dim[2] - 1 num_of_neurons - 10 # build data frame to store results of neural net for each run nn.mse - data.frame(Train_MSE=seq(1:num_of_neurons), Valid_MSE=seq(1:num_of_neurons)) # open log file and redirect output to log file sink(D:\\XXX\\YYY\\ Programs\\Neural_Network_v8_VR_log.txt) # outer loop - loop over variables for (i in 3:df_dim[2]) { # df_dim[2] # inner loop - loop over number of hidden neurons for (j in 1:num_of_neurons) { # upto 10 neurons in the hidden layer # need to create a new data frame with just the predictor/input variables needed train3 - train2[,c(1:i)] coreaff.nn - nnet(dep_var ~ ., train3, size = j, decay = 1e-3, linout = T, skip = T, maxit = 1000, Hess = T) # row_vars[[j]] - coreaff.nn$call # not what we want # row_vars[[j]] - names(train3)[c(2:i)] # not needed in inner loop - same number of variables for all neurons row_sum[[j]] - summary(coreaff.nn) row_wts[[j]] - coreaff.nn$wts rownames(nn.mse)[j] - paste(H, j, sep=) nn.mse[j, Train_MSE] - mean((train3$dep_var - predict(coreaff.nn))^2) nn.mse[j, Valid_MSE] - mean((valid$dep_var - predict(coreaff.nn, valid))^2) } col_vars[[i-2]] - names(train3)[c(2:i)] col_sum[[i-2]] - row_sum col_wts[[i-2]] - row_wts col_mse[[i-2]] - nn.mse } # cbind(col_vars[1],col_vars[2]) col_vars col_sum col_wts sink() cbind(col_mse[[1]],col_mse[[2]],col_mse[[3]],col_mse[[4]],col_mse[[5]],c ol_mse[[6]],col_mse[[7]],
[R] mathematical model/equations for neural network in library(nnet)
Hi All, I am trying to manually extract the scoring equations for a neural network so that I can score clients on a system that does not have R (mainframe using COBOL). Using the example in Modern Applied Statistics with S (MASS), by Venables and Ripley (VR), pages 246 and 247, I ran the following neural network. The code is the same as in VR pages 246 and 247 except I have skip = F. The equation will have 3 more terms if skip = T. library(nnet) attach(rock) area1 - area/1; peri1 - peri/1 rock1 - data.frame(perm, area = area1, peri = peri1, shape) # skip = F rock2.nn - nnet(log(perm) ~ area + peri + shape, rock1, size=3, decay=1e-3, linout=T, skip=F, maxit=1000, Hess=T) # weights: 16 initial value 1420.968942 iter 10 value 96.823665 iter 20 value 32.177295 iter 30 value 25.012430 iter 40 value 23.109650 iter 50 value 20.981236 iter 60 value 15.019016 iter 70 value 14.082190 iter 80 value 14.042717 iter 90 value 13.931124 iter 100 value 13.883691 iter 110 value 13.877307 iter 120 value 13.875051 iter 130 value 13.873667 final value 13.873634 converged summary(rock2.nn) The output from summary(rock2.nn) is: a 3-3-1 network with 16 weights options were - linear output units decay=0.001 b-h1 i1-h1 i2-h1 i3-h1 10.65 -8.90 -14.63 6.17 b-h2 i1-h2 i2-h2 i3-h2 -0.72 11.76 -17.17 -1.56 b-h3 i1-h3 i2-h3 i3-h3 2.96 -9.03 -8.07 -2.54 b-o h1-o h2-o h3-o -6.91 2.45 11.53 9.22 Following the mathematical model / equations shown in VR (pages 243 to 247) and another book on neural networks, I extracted the neural network equations manually, and scored the dataset rock1, and compared the manual scores I obtained with the scores from predict(). They were totally different, and I am not sure what I am doing wrong. If anyone can give me some pointers I would appreciate it. The mathematical model/equations I come up with from the weights are: # manual calculate neural network predictions based on neural network equations rock1$h1 - 10.65 - 8.9 * rock1$area - 14.63 * rock1$peri + 6.17 * rock1$shape rock1$logistic_h1 - exp(rock1$h1) / (1 + exp(rock1$h1)) rock1$h2 - -0.72 + 11.76 * rock1$area - 11.17 * rock1$peri - 1.56 * rock1$shape rock1$logistic_h2 - exp(rock1$h2) / (1 + exp(rock1$h2)) rock1$h3 - 2.96 - 9.03 * rock1$area - 8.07 * rock1$peri - 2.54 * rock1$shape rock1$logistic_h3 - exp(rock1$h3) / (1 + exp(rock1$h3)) # predictions based on manual scoring rock1$pred_perm - -6.91 + 2.45 * rock1$logistic_h1 + 11.53 * rock1$logistic_h2 + 9.22 * rock1$logistic_h3 # predictions using predict() and object that has the output of the neural network rock1$nn_pred_perm - predict(rock2.nn) rock1$log_perm - log(rock1$perm) head(rock1) perm area peri shape h1 logistic_h1 h2 logistic_h2h3 logistic_h3 pred_perm nn_pred_perm log_perm 1 6.3 0.4990 0.279190 0.0903296 2.6816839 0.9359372 1.888774 0.8686156 -4.028470 0.0174901847 5.559444 1.920348 1.840550 2 6.3 0.7002 0.389260 0.1486220 -0.3596561 0.4110428 2.934467 0.9495242 -6.881634 0.0010254128 5.054524 1.546815 1.840550 3 6.3 0.7558 0.393066 0.1833120 -0.6961405 0.3326685 3.491694 0.9704505 -7.502529 0.0005513831 5.099416 2.630932 1.840550 4 6.3 0.7352 0.386932 0.1170630 -0.8318165 0.3032611 3.421303 0.9683637 -7.098737 0.0008254655 5.005834 2.489565 1.840550 5 17.1 0.7943 0.394854 0.1224170 -1.4406711 0.1914414 4.019478 0.9823546 -7.709940 0.0004481475 4.889712 3.235397 2.839078 6 17.1 0.7979 0.401015 0.1670450 -1.2874918 0.2162777 3.923376 0.9806092 -7.905522 0.0003685659 4.929703 3.078584 2.839078 sum((log(perm) - rock1$nn_pred_perm)^2) [1] 12.55929 sum((log(perm) - rock1$pred_perm)^2) [1] 82.63254 Thanks in advance, Jude ___ Jude Ryan Director, Client Analytical Services Strategy Business Development UBS Financial Services Inc. 1200 Harbor Boulevard, 4th Floor Weehawken, NJ 07086-6791 Tel. 201-352-1935 Fax 201-272-2914 Email: jude.r...@ubs.com Please do not transmit orders or instructions regarding a UBS account electronically, including but not limited to e-mail, fax, text or instant messaging. The information provided in this e-mail or any attachments is not an official transaction confirmation or account statement. For your protection, do not include account numbers, Social Security numbers, credit card numbers, passwords or other non-public information in your e-mail. Because the information contained in this message may be privileged, confidential, proprietary or otherwise protected from disclosure, please notify us immediately by replying to this message and deleting it from your computer if you have received this communication in error. Thank you. UBS Financial Services Inc. UBS International Inc. UBS Financial Services Incorporated of Puerto Rico UBS AG UBS reserves the
[R] neural network not using all observations
I am exploring neural networks (adding non-linearities) to see if I can get more predictive power than a linear regression model I built. I am using the function nnet and following the example of Venables and Ripley, in Modern Applied Statistics with S, on pages 246 to 249. I have standardized variables (z-scores) such as assets, age and tenure. I have other variables that are binary (0 or 1). In max_acc_ownr_nwrth_n_med for example, the variable has a value of 1 if the client's net worth is above the median net worth and a value of 0 otherwise. These are derived variable I created and variables that the regression algorithm has found to be predictive. A regression on the same variables shown below gives me an R-Square of about 0.12. I am trying to increase the predictive power of this regression model with a neural network being careful to avoid overfitting. Similar to Venables and Ripley, I used the following code: library(nnet) dim(coreaff.trn.nn) [1] 50888 head(coreaff.trn.nn) hh.iast.y WC_Total_Assets all_assets_per_hh age tenure max_acc_ownr_liq_asts_n_med max_acc_ownr_nwrth_n_med max_acc_ownr_ann_incm_n_med 1 3059448 -0.4692186-0.4173532 -0.06599001 -1.04747935 01 0 2 4899746 3.4854334 4.064 -0.06599001 -0.72540200 11 1 3727333 -0.2677357-0.4177944 -0.30136473 -0.40332465 11 1 4443138 -0.5295170-0.6999646 -0.1825 -1.04747935 00 0 5484253 -0.6112205-0.7306664 0.64013414 0.07979137 10 0 6799054 0.6580506 1.1763114 0.24784295 0.07979137 01 1 coreaff.nn1 - nnet(hh.iast.y ~ WC_Total_Assets + all_assets_per_hh + age + tenure + max_acc_ownr_liq_asts_n_med + + max_acc_ownr_nwrth_n_med + max_acc_ownr_ann_incm_n_med, coreaff.trn.nn, size = 2, decay = 1e-3, + linout = T, skip = T, maxit = 1000, Hess = T) # weights: 26 initial value 12893652845419998.00 iter 10 value 6352515847944854.00 final value 6287104424549762.00 converged summary(coreaff.nn1) a 7-2-1 network with 26 weights options were - skip-layer connections linear output units decay=0.001 b-h1 i1-h1 i2-h1 i3-h1 i4-h1 i5-h1 i6-h1 i7-h1 -21604.84 -2675.80 -5001.90 -1240.16-335.44 -12462.51 -13293.80 -9032.34 b-h2 i1-h2 i2-h2 i3-h2 i4-h2 i5-h2 i6-h2 i7-h2 210841.52 47296.92 58100.43 -13819.10 -9195.80 117088.99 131939.57 106994.47 b-o h1-o h2-o i1-o i2-o i3-o i4-o i5-o i6-o i7-o 1115190.67 894123.33 -417269.57 89621.84 170268.12 44833.63 59585.05 112405.30 437581.05 244201.69 sum((hh.iast.y - predict(coreaff.nn1))^2) Error: object hh.iast.y not found So I try: sum((coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1))^2) Error: dims [product 5053] do not match the length of object [5088] In addition: Warning message: In coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1) : longer object length is not a multiple of shorter object length Doing a little debugging: pred - predict(coreaff.nn1) dim(pred) [1] 50531 dim(coreaff.trn.nn) [1] 50888 So it looks like the dimensions (number of records/cases) of the vector pred is 5,053 and the number of records of the input dataset is 5,088. It looks like the neural network is dropping 35 records. Does anyone have any idea of why it would do this? It is most probably because those 35 records are bad data, a pretty common occurrence in the real world. Does anyone know how I can identify the dropped records? If I can do this I can get the dimensions of the input dataset to be 5,053 and then: sum((coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1))^2) would work. A summary of my dataset is: summary(coreaff.trn.nn) hh.iast.yWC_Total_Assets all_assets_per_hh age tenure max_acc_ownr_liq_asts_n_med Min. : 0 Min. :-6.970e-01 Min. :-8.918e-01 Min. :-4.617e+00 Min. :-1.209e+00 Min. :0. 1st Qu.: 565520 1st Qu.:-5.387e-01 1st Qu.:-6.147e-01 1st Qu.:-4.583e-01 1st Qu.:-7.254e-01 1st Qu.:0. Median : 834164 Median :-3.160e-01 Median :-3.718e-01 Median : 9.093e-02 Median :-2.423e-01 Median :0. Mean : 1060244 Mean : 2.948e-13 Mean : 3.204e-12 Mean :-1.884e-11 Mean :-3.302e-12 Mean :0.4951 3rd Qu.: 1207181 3rd Qu.: 1.127e-01 3rd Qu.: 1.891e-01 3rd Qu.: 5.617e-01 3rd Qu.: 5.629e-01 3rd Qu.:1. Max. :45003160 Max. :
[R] FW: neural network not using all observations
As a follow-up to my email below: The input data frame to nnet() has dimensions: dim(coreaff.trn.nn) [1] 50888 And the predictions from the neural network (35 records are dropped - see email below for more details) has dimensions: pred - predict(coreaff.nn1) dim(pred) [1] 50531 So, the following line of R code does not work as the dimensions are different. sum((coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1))^2) Error: dims [product 5053] do not match the length of object [5088] In addition: Warning message: In coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1) : longer object length is not a multiple of shorter object length While: dim(pred) [1] 50531 tail(pred) [,1] 5083 664551.9 5084 552170.6 5085 684834.3 5086 1215282.5 5087 1116302.2 5088 658112.1 shows that the last row of pred is 5,088, which corresponds to the dimension of coreaff.trn.nn, the input data frame to the neural network. I tried using row() to identify the 35 records that were dropped (or not scored). The code I tried was: coreaff.trn.nn.subset - coreaff.trn.nn[row(coreaff.trn.nn) == row(pred), ] Error in row(coreaff.trn.nn) == row(pred) : non-conformable arrays But I am not doing something right. pred has dimension = 1 and row() requires an object of dimension = 2. So using cbind() I bound a column of sequence numbers to pred to make the dimension = 2 but that did not help. Basically, if I can identify the 5,053 records that the neural network made predictions for, in the data frame of 5,088 records (coreaff.trn.nn) used by the neural network, then I can compare the predictions to the actual values, and compare the predictive power of the neural network to the predictive power of the linear regression model. Any idea how I can extract the 5,053 records that the neural network made predictions for from the data frame (5,088 records) used to train the neural network? Thanks in advance, Jude From: Ryan, Jude Sent: Tuesday, May 12, 2009 11:11 AM To: 'r-help@r-project.org' Cc: juderya...@yahoo.com Subject: neural network not using all observations I am exploring neural networks (adding non-linearities) to see if I can get more predictive power than a linear regression model I built. I am using the function nnet and following the example of Venables and Ripley, in Modern Applied Statistics with S, on pages 246 to 249. I have standardized variables (z-scores) such as assets, age and tenure. I have other variables that are binary (0 or 1). In max_acc_ownr_nwrth_n_med for example, the variable has a value of 1 if the client's net worth is above the median net worth and a value of 0 otherwise. These are derived variable I created and variables that the regression algorithm has found to be predictive. A regression on the same variables shown below gives me an R-Square of about 0.12. I am trying to increase the predictive power of this regression model with a neural network being careful to avoid overfitting. Similar to Venables and Ripley, I used the following code: library(nnet) dim(coreaff.trn.nn) [1] 50888 head(coreaff.trn.nn) hh.iast.y WC_Total_Assets all_assets_per_hh age tenure max_acc_ownr_liq_asts_n_med max_acc_ownr_nwrth_n_med max_acc_ownr_ann_incm_n_med 1 3059448 -0.4692186-0.4173532 -0.06599001 -1.04747935 01 0 2 4899746 3.4854334 4.064 -0.06599001 -0.72540200 11 1 3727333 -0.2677357-0.4177944 -0.30136473 -0.40332465 11 1 4443138 -0.5295170-0.6999646 -0.1825 -1.04747935 00 0 5484253 -0.6112205-0.7306664 0.64013414 0.07979137 10 0 6799054 0.6580506 1.1763114 0.24784295 0.07979137 01 1 coreaff.nn1 - nnet(hh.iast.y ~ WC_Total_Assets + all_assets_per_hh + age + tenure + max_acc_ownr_liq_asts_n_med + + max_acc_ownr_nwrth_n_med + max_acc_ownr_ann_incm_n_med, coreaff.trn.nn, size = 2, decay = 1e-3, + linout = T, skip = T, maxit = 1000, Hess = T) # weights: 26 initial value 12893652845419998.00 iter 10 value 6352515847944854.00 final value 6287104424549762.00 converged summary(coreaff.nn1) a 7-2-1 network with 26 weights options were - skip-layer connections linear output units decay=0.001 b-h1 i1-h1 i2-h1 i3-h1 i4-h1 i5-h1 i6-h1 i7-h1 -21604.84 -2675.80 -5001.90 -1240.16-335.44 -12462.51 -13293.80 -9032.34 b-h2 i1-h2 i2-h2 i3-h2 i4-h2 i5-h2 i6-h2 i7-h2 210841.52 47296.92 58100.43 -13819.10
[R] How do I extract the scoring equations for neural networks and support vector machines?
Sorry for these multiple postings. I solved the problem using na.omit() to drop records with missing values for the time being. I will worry about imputation, etc. later. I calculated the sum of squared errors for 3 models, linear regression, neural networks, and support vector machines. This is the first run. Without doing any parameter tuning on the SVM or playing around with the number of nodes in the hidden layer of the neural network, I found that the SVM had the lowest sum of squared errors, followed by neural networks, with regression being last. This probably indicates that the data has non-linear patterns. I have a couple of questions. 1) Besides sum of squared errors, are there any other metrics that can be used to compare these 3 models? AIC, BIC, etc, can be used for regressions, but I am not sure whether they can be used for SVM's and neural networks. 2) Is there any easy way to extract the scoring equations for SVM's and neural networks? Using the R objects I can always score new data manually but the model will need to be implemented in a production environment. When the model gets implemented in production (could be the mainframe) I will need equations that can be coded in any language (COBOL or SAS on the mainframe). Also, getting the scoring equations for all 3 models will let me create an ensemble model where the predicted value could be the average of the predictions from the SVM, neural network and linear regression. If the ensemble model has the smallest sum of squared errors this would be the model I would use. I have SAS Enterprise Miner as well and can get a scoring equation for the neural network (I don't have SVM), but the scoring code that SAS EM generates sucks and I would much rather extract a scoring equation from R. I am using nnet() for the neural network. Thanks in advance, Jude Ryan From: Ryan, Jude Sent: Tuesday, May 12, 2009 1:23 PM To: 'r-help@r-project.org' Cc: juderya...@yahoo.com Subject: FW: neural network not using all observations As a follow-up to my email below: The input data frame to nnet() has dimensions: dim(coreaff.trn.nn) [1] 50888 And the predictions from the neural network (35 records are dropped - see email below for more details) has dimensions: pred - predict(coreaff.nn1) dim(pred) [1] 50531 So, the following line of R code does not work as the dimensions are different. sum((coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1))^2) Error: dims [product 5053] do not match the length of object [5088] In addition: Warning message: In coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1) : longer object length is not a multiple of shorter object length While: dim(pred) [1] 50531 tail(pred) [,1] 5083 664551.9 5084 552170.6 5085 684834.3 5086 1215282.5 5087 1116302.2 5088 658112.1 shows that the last row of pred is 5,088, which corresponds to the dimension of coreaff.trn.nn, the input data frame to the neural network. I tried using row() to identify the 35 records that were dropped (or not scored). The code I tried was: coreaff.trn.nn.subset - coreaff.trn.nn[row(coreaff.trn.nn) == row(pred), ] Error in row(coreaff.trn.nn) == row(pred) : non-conformable arrays But I am not doing something right. pred has dimension = 1 and row() requires an object of dimension = 2. So using cbind() I bound a column of sequence numbers to pred to make the dimension = 2 but that did not help. Basically, if I can identify the 5,053 records that the neural network made predictions for, in the data frame of 5,088 records (coreaff.trn.nn) used by the neural network, then I can compare the predictions to the actual values, and compare the predictive power of the neural network to the predictive power of the linear regression model. Any idea how I can extract the 5,053 records that the neural network made predictions for from the data frame (5,088 records) used to train the neural network? Thanks in advance, Jude From: Ryan, Jude Sent: Tuesday, May 12, 2009 11:11 AM To: 'r-help@r-project.org' Cc: juderya...@yahoo.com Subject: neural network not using all observations I am exploring neural networks (adding non-linearities) to see if I can get more predictive power than a linear regression model I built. I am using the function nnet and following the example of Venables and Ripley, in Modern Applied Statistics with S, on pages 246 to 249. I have standardized variables (z-scores) such as assets, age and tenure. I have other variables that are binary (0 or 1). In max_acc_ownr_nwrth_n_med for example, the variable has a value of 1 if the client's net worth is above the median net worth and a value of 0 otherwise. These are derived variable I created and variables that the regression algorithm has found to be predictive. A regression on the same variables shown below gives me an R-Square of
[R] reading version 9 SAS datasets in R
Hi, I am trying to read a SAS version 9.1.3 SAS dataset into R (to preserve the SAS labels), but am unable to do so (I have read in a CSV version). I first created a transport file using the SAS code: libname ces2 'D:\CES Analysis\Data'; filename transp 'D:\CES Analysis\Data\fadata.xpt'; /* create a transport file - R cannot read file created by proc cport */ proc cport data=ces2.fadata file=transp; run; I then tried to read it in R using: library(foreign) library(Hmisc) fadata2 - sasxport.get(D:\\CES Analysis\\Data\\fadata.xpt) Error in lookup.xport(file) : file not in SAS transfer format Next I tried using the libname statement and the xport engine to create a transport file. The problem with this method is that variable names cannot be more than 8 characters as this method creates a SAS version 6 transport file. libname to_r xport 'D:\CES Analysis\Data\fadata2.xpt'; data to_r.fadata2; set ces2.fadata; run; But I get an error message in the SAS log: 493 libname to_r xport 'D:\CES Analysis\Data\fadata2.xpt'; NOTE: Libref TO_R was successfully assigned as follows: Engine:XPORT Physical Name: D:\CES Analysis\Data\fadata2.xpt 494 495 data to_r.fadata2; 496set ces2.fadata; 497 run; ERROR: The variable name BUS_TEL_N is illegal for the version 6 file TO_R.FADATA2.DATA. NOTE: The SAS System stopped processing this step because of errors. WARNING: The data set TO_R.FADATA2 was only partially opened and will not be saved. Next I tried other ways of reading a SAS dataset in R, as shown below: fadata2 - sas.get(D:\\CES Analysis\\Data, mem=fadata) Error in sas.get(D:\\CES Analysis\\Data, mem = fadata) : Unix file, D:\CES Analysis\Data/c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .sd2 D:\CES Analysis\Data/c(NA, 64716, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 64716, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, In addition: Warning message: In sas.get(D:\\CES Analysis\\Data, mem = fadata) : D:\CES Analysis\Data/formats.sc? or formats.sas7bcat not found. Formatting ignored. ls() [1] fadata ?read.xport fadata2 - read.xport(D:\\CES Analysis\\Data\\fadata.xpt) Error in lookup.xport(file) : file not in SAS transfer format ?read.ssd fadata2 - read.ssd(D:\\CES Analysis\\Data, fadata) SAS failed. SAS program at D:\DOCUME~1\re06572\LOCALS~1\Temp\RtmpLqCVUx\file72ae2cd6.sas The log file will be file72ae2cd6.log in the current directory Warning messages: 1: In system(paste(sascmd, tmpProg)) : sas not found 2: In read.ssd(D:\\CES Analysis\\Data, fadata) : SAS return code was -1 sashome - C:\\Program Files\\SAS\\SAS 9.1 fadata2 - read.ssd(file.path(sashome, core, sashelp), fadata, sascmd=file.path(sashome, sas.exe)) SAS failed. SAS program at D:\DOCUME~1\re06572\LOCALS~1\Temp\RtmpLqCVUx\file6df11649.sas The log file will be file6df11649.log in the current directory Warning message: In read.ssd(file.path(sashome, core, sashelp), fadata, sascmd = file.path(sashome, : SAS return code was 2 Is there any way I can read in a SAS version 9 dataset in R, so that I can preserve the SAS labels? If I have to change the SAS variable names to be 8 characters or less, to create a SAS version 6 transport file, I could probably do without the SAS labels as I have already read in the data into R from a CSV file. Thanks in advance for any help. Jude ___ Jude Ryan Director, Client Analytic Services Strategy Business Development UBS Financial Services Inc. 1200 Harbor Boulevard, 4th Floor Weehawken, NJ 07086-6791 Tel. 201-352-1935 Fax 201-272-2914 Email: [EMAIL PROTECTED] Please do not transmit orders or instructions regarding a UBS account by e-mail. The information provided in this e-mail or any attachments is not an official transaction confirmation or account statement. For your protection, do not include account numbers, Social Security numbers, credit card numbers, passwords or other non-public information in your e-mail. Because the information contained in this message may be privileged, confidential, proprietary or otherwise