Re: [R] Forming Portfolios for Fama / French Regression

2010-01-12 Thread jude.ryan
Kai,

 

Your question is best addressed to r-sig-fina...@stat.math.ethz.ch as
it is finance related question.

 

Jude

 

___
Jude Ryan
Director, Client Analytical Services
Strategy  Business Development
UBS Financial Services Inc.
1200 Harbor Boulevard, 4th Floor
Weehawken, NJ 07086-6791
Tel. 201-352-1935
Fax 201-272-2914
Email: jude.r...@ubs.com



Please do not transmit orders or instructions regarding a UBS
account electronically, including but not limited to e-mail,
fax, text or instant messaging. The information provided in
this e-mail or any attachments is not an official transaction
confirmation or account statement. For your protection, do not
include account numbers, Social Security numbers, credit card
numbers, passwords or other non-public information in your e-mail.
Because the information contained in this message may be privileged,
confidential, proprietary or otherwise protected from disclosure,
please notify us immediately by replying to this message and
deleting it from your computer if you have received this
communication in error. Thank you.

UBS Financial Services Inc.
UBS Financial Services Incorporated of Puerto Rico
UBS AG\ \ \ UBS reserves the right to retain all message...{{dropped:7}}
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Bhattacharyya distance metric

2009-11-09 Thread jude.ryan
The Bhattacharyya distance is different from the Mahalanobis distance.
See:

 

http://en.wikipedia.org/wiki/Bhattacharyya_distance

 

There is also the Hellinger Distance and the Rao distance. For the Rao
distance, see:

 

http://www.scholarpedia.org/article/Fisher-Rao_metric

 

Jude

 

___
Jude Ryan
Director, Client Analytical Services
Strategy  Business Development
UBS Financial Services Inc.
1200 Harbor Boulevard, 4th Floor
Weehawken, NJ 07086-6791
Tel. 201-352-1935
Fax 201-272-2914
Email: jude.r...@ubs.com



Please do not transmit orders or instructions regarding a UBS 
account electronically, including but not limited to e-mail, 
fax, text or instant messaging. The information provided in 
this e-mail or any attachments is not an official transaction 
confirmation or account statement. For your protection, do not 
include account numbers, Social Security numbers, credit card 
numbers, passwords or other non-public information in your e-mail. 
Because the information contained in this message may be privileged, 
confidential, proprietary or otherwise protected from disclosure, 
please notify us immediately by replying to this message and 
deleting it from your computer if you have received this 
communication in error. Thank you. 

UBS Financial Services Inc. 
UBS International Inc. 
UBS Financial Services Incorporated of Puerto Rico 
UBS AG

 
UBS reserves the right to retain all messages. Messages are protected
and accessed only in legally justified cases.__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] compute differences

2009-09-23 Thread jude.ryan
Alessandro Carletti wrote:

 

Hi,

I have a problem.

I have a data frame looking like:

 

ID val

 

A? .3

B? 1.2

C? 3.4

D? 2.2

E? 2.0

 

I need to CREATE the following TABLE:

 

CASE?? DIFF

 

A-A??? 0

A-B??? -0.9

A-C??? -3.1

A-D??? -1.9

A-E??? -1.7

B-A??? ...

B-B??? ...

B-C

B-D

B-E

C-A

...

 

WHERE CASE IS THE COUPLE OF ELEMENTS CONSIDEREDM AND DIFF IS THE
computed DIFFERENCE between their values.

 

Could you give me suggestions?

 

Solution:

Besides the suggestions given by others, you can use the sqldf package
to do this (leveraging knowledge in SQL if you know SQL). If you join
your data frame with itself, without a join condition, you will get the
Cartesian product of the two data frames, which seems to be exactly what
you need. A warning is in order. Generally when you join 2 (or more)
data frames you DO NOT want the Cartesian product by want to join the
data frames by some key. The solution to your particular problem,
however, can be implemented easily using the Cartesian product.

 

mydata - data.frame(id=rep(c('A','B','C','D','E'), each=2),
val=sample(1:5, 10, replace=T))

mydata

library(sqldf)

# merge data frame with itself to create a Cartesian Product - this is
normally NOT what you want.

# Note 'case' is a key word in SQL so I use cases for the variable name.
Likewise diff is a used in R so I use diffr

mydata2 - sqldf(select a.id as id1, a.val as val1, b.id as id2, b.val
as val2, a.id || ' - ' || b.id as cases,

 a.val - b.val as diffr from mydata a, mydata b)

dim(mydata2) # check dimensions of the merged dataset

head(mydata2) # examine the first 6 records

# if you want only the columns casses and diffr, then use this SQL code

mydata3 - sqldf(select a.id || ' - ' || b.id as cases, a.val - b.val
as diffr from mydata a, mydata b)

dim(mydata3) # check dimensions of the merged dataset

head(mydata3) # examine the first 6 records

 

Hope this helps.

 

Jude

___
Jude Ryan
Director, Client Analytical Services
Strategy  Business Development
UBS Financial Services Inc.
1200 Harbor Boulevard, 4th Floor
Weehawken, NJ 07086-6791
Tel. 201-352-1935
Fax 201-272-2914
Email: jude.r...@ubs.com



Please do not transmit orders or instructions regarding a UBS 
account electronically, including but not limited to e-mail, 
fax, text or instant messaging. The information provided in 
this e-mail or any attachments is not an official transaction 
confirmation or account statement. For your protection, do not 
include account numbers, Social Security numbers, credit card 
numbers, passwords or other non-public information in your e-mail. 
Because the information contained in this message may be privileged, 
confidential, proprietary or otherwise protected from disclosure, 
please notify us immediately by replying to this message and 
deleting it from your computer if you have received this 
communication in error. Thank you. 

UBS Financial Services Inc. 
UBS International Inc. 
UBS Financial Services Incorporated of Puerto Rico 
UBS AG

 
UBS reserves the right to retain all messages. Messages are protected
and accessed only in legally justified cases.__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] compute differences

2009-09-23 Thread jude.ryan
Thanks Petr! It is good to see multiple solutions to the same problem.

Best,

Jude

-Original Message-
From: Petr PIKAL [mailto:petr.pi...@precheza.cz] 
Sent: Wednesday, September 23, 2009 10:59 AM
To: Ryan, Jude
Cc: alxmil...@yahoo.it; r-help@r-project.org
Subject: Re: [R] compute differences

Hi

You can use outer. If your data are in data frame test then

DIFF - as.vector(t(outer(test$val, test$val, -)))

returns a vector, You just need to add suitable names to rows.

CASE - as.vector(t(outer(test$ID, test$ID, paste, sep=-)))

data.frame(CASE, DIFF)

will put it together.

Regards
Petr


r-help-boun...@r-project.org napsal dne 23.09.2009 16:42:45:

 Alessandro Carletti wrote:
 
 
 
 Hi,
 
 I have a problem.
 
 I have a data frame looking like:
 
 
 
 ID val
 
 
 
 A? .3
 
 B? 1.2
 
 C? 3.4
 
 D? 2.2
 
 E? 2.0
 
 
 
 I need to CREATE the following TABLE:
 
 
 
 CASE?? DIFF
 
 
 
 A-A??? 0
 
 A-B??? -0.9
 
 A-C??? -3.1
 
 A-D??? -1.9
 
 A-E??? -1.7
 
 B-A??? ...
 
 B-B??? ...
 
 B-C
 
 B-D
 
 B-E
 
 C-A
 
 ...
 
 
 
 WHERE CASE IS THE COUPLE OF ELEMENTS CONSIDEREDM AND DIFF IS THE
 computed DIFFERENCE between their values.
 
 
 
 Could you give me suggestions?
 
 
 
 Solution:
 
 Besides the suggestions given by others, you can use the sqldf package
 to do this (leveraging knowledge in SQL if you know SQL). If you join
 your data frame with itself, without a join condition, you will get
the
 Cartesian product of the two data frames, which seems to be exactly
what
 you need. A warning is in order. Generally when you join 2 (or more)
 data frames you DO NOT want the Cartesian product by want to join the
 data frames by some key. The solution to your particular problem,
 however, can be implemented easily using the Cartesian product.
 
 
 
 mydata - data.frame(id=rep(c('A','B','C','D','E'), each=2),
 val=sample(1:5, 10, replace=T))
 
 mydata
 
 library(sqldf)
 
 # merge data frame with itself to create a Cartesian Product - this is
 normally NOT what you want.
 
 # Note 'case' is a key word in SQL so I use cases for the variable
name.
 Likewise diff is a used in R so I use diffr
 
 mydata2 - sqldf(select a.id as id1, a.val as val1, b.id as id2,
b.val
 as val2, a.id || ' - ' || b.id as cases,
 
  a.val - b.val as diffr from mydata a, mydata b)
 
 dim(mydata2) # check dimensions of the merged dataset
 
 head(mydata2) # examine the first 6 records
 
 # if you want only the columns casses and diffr, then use this SQL
code
 
 mydata3 - sqldf(select a.id || ' - ' || b.id as cases, a.val - b.val
 as diffr from mydata a, mydata b)
 
 dim(mydata3) # check dimensions of the merged dataset
 
 head(mydata3) # examine the first 6 records
 
 
 
 Hope this helps.
 
 
 
 Jude
 
 ___
 Jude Ryan
 Director, Client Analytical Services
 Strategy  Business Development
 UBS Financial Services Inc.
 1200 Harbor Boulevard, 4th Floor
 Weehawken, NJ 07086-6791
 Tel. 201-352-1935
 Fax 201-272-2914
 Email: jude.r...@ubs.com
 
 
 
 Please do not transmit orders or instructions regarding a UBS 
 account electronically, including but not limited to e-mail, 
 fax, text or instant messaging. The information provided in 
 this e-mail or any attachments is not an official transaction 
 confirmation or account statement. For your protection, do not 
 include account numbers, Social Security numbers, credit card 
 numbers, passwords or other non-public information in your e-mail. 
 Because the information contained in this message may be privileged, 
 confidential, proprietary or otherwise protected from disclosure, 
 please notify us immediately by replying to this message and 
 deleting it from your computer if you have received this 
 communication in error. Thank you. 
 
 UBS Financial Services Inc. 
 UBS International Inc. 
 UBS Financial Services Incorporated of Puerto Rico 
 UBS AG
 
 
 UBS reserves the right to retain all messages. Messages are protected
 and accessed only in legally justified 
 cases.__
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Please do not transmit orders or instructions regarding a UBS 
account electronically, including but not limited to e-mail, 
fax, text or instant messaging. The information provided in 
this e-mail or any attachments is not an official transaction 
confirmation or account statement. For your protection, do not 
include account numbers, Social Security numbers, credit card 
numbers, passwords or other non-public information in your e-mail. 
Because the information contained in this message may be privileged, 
confidential, proprietary or otherwise protected from disclosure, 
please notify us immediately by replying to this message and 
deleting it from your computer if you have received this 
communication 

Re: [R] Recursive partitioning algorithms in R vs. alia

2009-06-23 Thread jude.ryan
Thanks for your point of view Terry! It is always fascinating to follow
the history of the field, especially as told by someone involved with
it.

Jude Ryan

-Original Message-
From: Terry Therneau [mailto:thern...@mayo.edu] 
Sent: Tuesday, June 23, 2009 9:22 AM
To: Ryan, Jude; c...@datanalytics.com
Cc: r-help@r-project.org
Subject: Re: [R] Recursive partitioning algorithms in R vs. alia

A point of history:

  Both the commercial CART program and the rpart() function are based on
the 
book Classification and Regression Trees (Breiman, Friedman, Olshen,
Stone, 
1984).  As a reader/commentator on one of the early drafts I got to know
the 
material well.  CART started as a large Fortran program written by Jerry

Friedman which was the testing ground for the ideas in the book.  I had
the code 
at one time and made some modifications to it, but found it too
frustrating to 
go very far with. Fortran is just too clumsy for a recursive task, and
Jerry's 
ability to hold upteen variables in his head at once greater than mine
-- the 
Fortran was a large monlithic block.  Salford Systems aquired rights to
that 
code; I don't know whether any of the original lines remain in their
product.  I 
had lots of conversations with their main programmer (15-20 years ago
now) about 
methods for speeding it up; mainly an interesting problem in optimal
indexing.

   When rpart was first written it's output agreed with CART almost
entirely.  
The only major difference was in surrogates: I pick the surrogate with
the 
largest number of agreements, CART picked that with the greatest %
agreement.  
This means that rpart favors variables with fewer missing values.  Since
that 
point in time both codes have evolved.  I haven't had time to do
important work 
on rpart in over a decade.  It' not surprising that the graphics and
display are 
behind the curve, what's more surprising is that it still endures.
   
   Rpart is called rpart because the authors copyrighted the term
CART for 
their program.  It was the best alternative name that I could come up
with at 
the time.  I find it amusing that one consequence of their copyright
choice is 
that I now see recursive partitioning  far more often than CART as
the 
generic label for tree based methods.
   
   Terry T
   

Please do not transmit orders or instructions regarding a UBS 
account electronically, including but not limited to e-mail, 
fax, text or instant messaging. The information provided in 
this e-mail or any attachments is not an official transaction 
confirmation or account statement. For your protection, do not 
include account numbers, Social Security numbers, credit card 
numbers, passwords or other non-public information in your e-mail. 
Because the information contained in this message may be privileged, 
confidential, proprietary or otherwise protected from disclosure, 
please notify us immediately by replying to this message and 
deleting it from your computer if you have received this 
communication in error. Thank you. 

UBS Financial Services Inc. 
UBS International Inc. 
UBS Financial Services Incorporated of Puerto Rico 
UBS AG\ \  \ UBS reserves the right to retain all messag...{{dropped:6}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Recursive partitioning algorithms in R vs. alia

2009-06-22 Thread jude.ryan
I have used all 3 packages for decision trees (SAS/EM, CART and R). As
another user on the list commented, the algorithms CART uses are
proprietary. I also know that since the algorithms are proprietary, the
decision tree that you get from SAS is based on a slightly different
algorithm so as to not violate copyright laws. When I first started
using R (rpart) I benchmarked it (in terms of results obtained) for my
particular problem at the time against Salford Systems CART. R gave me
an identical tree with the splitting value being different in the 2nd or
3rd decimal place from what I recall. I did not have SAS/EM at that
particular company and so could not benchmark it. Salford Systems CART
does have additional types of splitting criteria such as towing etc.,
but again, these may be of value in certain types of problems. The
splitting criteria found in R are good enough. 

 

I do have SAS/EM right now but prefer R to SAS/EM since R can be
programmed and SAS/EM cannot. This may not be relevant for decision
trees but for neural networks, for example, if I want to build hundreds
of neural networks (since there are no variable selection methods for
neural networks) with different predictors and different number of
neurons, I can do this easily in R but cannot do this in SAS/EM. SAS/EM
does have a variable selection node but that is independent of the
neural network node, so, from what I understand, you have to select the
variables and then pass them to the neural network node.

 

In general, you get prettier output with CART and SAS/EM for trees.
However, there are packages in R that can give you prettier output than
rpart does. One GUI that you may want to explore, that works with R, is
Rattle. This builds trees, neural network, boosting, etc. and you can
see the generated R code as well.

 

In terms of handling large volumes of data, SAS/EM is probably the best.
However, if you have a 64 bit operating system with lots of RAM, and use
random sampling, R should suffice. It is debatable whether the extra
features like pretty output and variable importance is worth the huge
costs you have to pay for those products, unless you really need these
features. With R you can do what you want, and that is build a good
tree. From what I have read, variable importance measures can be biased
as they are affected by factors such as multicollinearity, variables
with many categories, etc., so their usefulness is questionable
(however, end-users may love them).

 

SAS/EM is by far the most expensive product, and Salford Systems CART is
pretty expensive as well. So depending on your needs, R may be good
enough or the best, because you can program it, and the latest
methodologies will always be implemented in R first. For comparisons of
the programming capabilities of SAS (macros) versus R you may want to
look at what Frank Harrell and Terry Thearneu (who wrote rpart) have to
say. Both are experts in SAS and R.

 

Hope this helps.

 

Jude

 

 

Carlos wrote:

 

Dear R-helpers,

 

I had a conversation with a guy working in a business intelligence

department at a major Spanish bank. They rely on recursive partitioning

methods to rank customers according to certain criteria. 

 

They use both SAS EM and Salford Systems' CART. I have used package R

part in the past, but I could not provide any kind of feature comparison

or the like as I have no access to any installation of the first two

proprietary products.

 

Has anybody experience with them? Is there any public benchmark

available? Is there any very good --although solely technical-- reason

to pay hefty software licences? How would the algorithms implemented in

rpart compare to those in SAS and/or CART?

 

Best regards,

 

Carlos J. Gil Bellosta

http://www.datanalytics.com http://www.datanalytics.com/ 

 

 

___
Jude Ryan
Director, Client Analytical Services
Strategy  Business Development
UBS Financial Services Inc.
1200 Harbor Boulevard, 4th Floor
Weehawken, NJ 07086-6791
Tel. 201-352-1935
Fax 201-272-2914
Email: jude.r...@ubs.com



Please do not transmit orders or instructions regarding a UBS 
account electronically, including but not limited to e-mail, 
fax, text or instant messaging. The information provided in 
this e-mail or any attachments is not an official transaction 
confirmation or account statement. For your protection, do not 
include account numbers, Social Security numbers, credit card 
numbers, passwords or other non-public information in your e-mail. 
Because the information contained in this message may be privileged, 
confidential, proprietary or otherwise protected from disclosure, 
please notify us immediately by replying to this message and 
deleting it from your computer if you have received this 
communication in error. Thank you. 

UBS Financial Services Inc. 
UBS International Inc. 
UBS Financial Services Incorporated of Puerto Rico 
UBS AG

 
UBS reserves the right to retain all messages. 

Re: [R] Problem in 'Apply' function: does anybody have other solution

2009-06-18 Thread jude.ryan
David Winsemius' solution:

 

 apply(data.matrix(df), 1, I)

   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

x12345678910

y13456789   10 2

 

For y and [,2] above the value is 3. Why is the value not 2? 

It looks like the value is 2 for y and [,10] (this should be 10, right?)

and values 3 to 10 are shifted one position to the left for y.

 

I got the same results when I ran this code.

 

Thanks,

 

Jude

 

David Winsemius wrote:

 

On Jun 17, 2009, at 9:27 AM, jim holtman wrote:

 

 Do an 'str' of your object.  It looks like one of the columns is  

 probably

 character/factor since there are quotes around the 'numbers'.  You  

 can also

 explicity convert the offending columns to numeric is you want to.   

 Also use

 colClasses on the read.csv to define the class of the data in each  

 column.

 This will should you where the error is.

 

One function that might be of use is data.matrix which will attempt to  

convert character vectors to numeric vectors across an entire  

dataframe. I hope this is not beating a dead horse, but see if these  

examples are helpful in any way:

 

  ?data.matrix

  df - data.frame(x=1:10,y=as.character(1:10))

  df

 x  y

1   1  1

2   2  2

3   3  3

4   4  4

5   5  5

6   6  6

7   7  7

8   8  8

9   9  9

10 10 10#  not all is as it seems

  apply(df,1,I)

   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

x  1  2  3  4  5  6  7  8  9 10

y 1  2  3  4  5  6  7  8  9  10

  df2 - data.frame(x=1:10,y=1:10)

  apply(df2,1,I)

   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

x12345678910

y12345678910

  str(df)

'data.frame': 10 obs. of  2 variables:

  $ x: int  1 2 3 4 5 6 7 8 9 10

  $ y: Factor w/ 10 levels 1,10,2,3,..: 1 3 4 5 6 7 8 9 10 2

 

# so that's weird. y isn't even a character vector !?!? Such are the  

strange beasts called factors.

 

# solution? or at least one strategy

 

  apply(data.matrix(df), 1, I)

   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

x12345678910

y13456789   10 2

 

 

 

 

___
Jude Ryan
Director, Client Analytical Services
Strategy  Business Development
UBS Financial Services Inc.
1200 Harbor Boulevard, 4th Floor
Weehawken, NJ 07086-6791
Tel. 201-352-1935
Fax 201-272-2914
Email: jude.r...@ubs.com



Please do not transmit orders or instructions regarding a UBS 
account electronically, including but not limited to e-mail, 
fax, text or instant messaging. The information provided in 
this e-mail or any attachments is not an official transaction 
confirmation or account statement. For your protection, do not 
include account numbers, Social Security numbers, credit card 
numbers, passwords or other non-public information in your e-mail. 
Because the information contained in this message may be privileged, 
confidential, proprietary or otherwise protected from disclosure, 
please notify us immediately by replying to this message and 
deleting it from your computer if you have received this 
communication in error. Thank you. 

UBS Financial Services Inc. 
UBS International Inc. 
UBS Financial Services Incorporated of Puerto Rico 
UBS AG

 
UBS reserves the right to retain all messages. Messages are protected
and accessed only in legally justified cases.__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Problem in 'Apply' function: does anybody have other solution

2009-06-18 Thread jude.ryan
Thanks! I did not look at the output of str(df) closely. Since y is
defined as a character variable when df is created (but stored as a
factor), it looks like str(df) is sorting the factors, at least when it
is displayed to the screen.

Jude

-Original Message-
From: David Winsemius [mailto:dwinsem...@comcast.net] 
Sent: Thursday, June 18, 2009 11:22 AM
To: Ryan, Jude
Cc: r-help@r-project.org
Subject: Re: [R] Problem in 'Apply' function: does anybody have other
solution

It's not a solution. Unfortunately data.matrix is no different with  
respect to factors than other functions. Note what str(df) produced  
for df$y.

-- 
David.

On Jun 18, 2009, at 10:59 AM, jude.r...@ubs.com jude.r...@ubs.com  
wrote:

 David Winsemius' solution:

  apply(data.matrix(df), 1, I)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 x12345678910
 y13456789   10 2

 For y and [,2] above the value is 3. Why is the value not 2?
 It looks like the value is 2 for y and [,10] (this should be 10,  
 right?)
 and values 3 to 10 are shifted one position to the left for y.

 I got the same results when I ran this code.

 Thanks,

 Jude

 David Winsemius wrote:

 On Jun 17, 2009, at 9:27 AM, jim holtman wrote:

  Do an 'str' of your object.  It looks like one of the columns is
  probably
  character/factor since there are quotes around the 'numbers'.  You
  can also
  explicity convert the offending columns to numeric is you want to.
  Also use
  colClasses on the read.csv to define the class of the data in each
  column.
  This will should you where the error is.

 One function that might be of use is data.matrix which will attempt to
 convert character vectors to numeric vectors across an entire
 dataframe. I hope this is not beating a dead horse, but see if these
 examples are helpful in any way:

   ?data.matrix
   df - data.frame(x=1:10,y=as.character(1:10))
   df
  x  y
 1   1  1
 2   2  2
 3   3  3
 4   4  4
 5   5  5
 6   6  6
 7   7  7
 8   8  8
 9   9  9
 10 10 10#  not all is as it seems
   apply(df,1,I)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 x  1  2  3  4  5  6  7  8  9 10
 y 1  2  3  4  5  6  7  8  9  10
   df2 - data.frame(x=1:10,y=1:10)
   apply(df2,1,I)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 x12345678910
 y12345678910
   str(df)
 'data.frame': 10 obs. of  2 variables:
   $ x: int  1 2 3 4 5 6 7 8 9 10
   $ y: Factor w/ 10 levels 1,10,2,3,..: 1 3 4 5 6 7 8 9 10 2

 # so that's weird. y isn't even a character vector !?!? Such are the
 strange beasts called factors.

 # solution? or at least one strategy

   apply(data.matrix(df), 1, I)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 x12345678910
 y13456789   10 2




 ___
 Jude Ryan
 Director, Client Analytical Services
 Strategy  Business Development
 UBS Financial Services Inc.
 1200 Harbor Boulevard, 4th Floor
 Weehawken, NJ 07086-6791
 Tel. 201-352-1935
 Fax 201-272-2914
 Email: jude.r...@ubs.com

 Please do not transmit orders or instructions regarding a UBS
 account electronically, including but not limited to e-mail,
 fax, text or instant messaging. The information provided in
 this e-mail or any attachments is not an official transaction
 confirmation or account statement. For your protection, do not
 include account numbers, Social Security numbers, credit card
 numbers, passwords or other non-public information in your e-mail.
 Because the information contained in this message may be privileged,
 confidential, proprietary or otherwise protected from disclosure,
 please notify us immediately by replying to this message and
 deleting it from your computer if you have received this
 communication in error. Thank you.

 UBS Financial Services Inc.
 UBS International Inc.
 UBS Financial Services Incorporated of Puerto Rico
 UBS AG


 UBS reserves the right to retain all messages. Messages are protected
 and accessed only in legally justified cases.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT

Please do not transmit orders or instructions regarding a UBS 
account electronically, including but not limited to e-mail, 
fax, text or instant messaging. The information provided in 
this e-mail or any attachments is not an official transaction 
confirmation or account statement. For your protection, do not 
include account numbers, Social Security numbers, credit card 
numbers, passwords or other non-public information in your e-mail. 
Because the information contained in this message may be privileged, 
confidential, proprietary or otherwise protected from disclosure, 
please notify us immediately by replying to this message and 
deleting it from your computer if you have received this 
communication in error. Thank 

Re: [R] Inf in nnet final value for validation data

2009-06-11 Thread jude.ryan
Andrea,

 

You can calculate predictions for your validation data based on nnet objects 
using the predict function (the predict function can also be used for 
regressions, quantile regressions, etc.)

If you create a neural net with the following code: 

 

library(nnet)

# 3 hidden neurons, for classification (linout = F), and not a skip layer 
network (skip = F, or T if you want)

mynet.nn - nnet(dependent_variable ~ ., data = train, size = 3, decay = 1e-3, 
linout = F, skip = F, maxit = 1000, Hess = T)  

# calculate predictions for your training data and append to data frame called 
train

train$predictions - predict(mynet.nn)

# calculate predictions for your validation data and append to data frame 
called valid

valid$predictions  - predict(mynet.nn, valid)  # you need to pass your neural 
net object and your validation dataset to the predict function

 

To just get the predictions for your validation dataset this is all you need. I 
do not know why you need to calculate the log likelihood.

 

Hope this helps.

 

Jude

 

 

Andrea wrote:

 

Hi,

 

I use nnet for my classification problem and have a problem concerning the 
calculation of the final value for my validation data.(nnet only calculates the 
final value for the training data). I made my own final value formula (for the 
training data I get the same value as nnet):

  

# prob-matrix

pmatrix - cat*fittedValues

tmp - rowSums(pmatrix) 



# -log likelihood

finalValue - sum(-log(tmp))



# add penalty term

finalValue + sum(decay * weights^2)

  

where cat is a matrix with cols for each possible category and a row for each 
data record. The values are 1 for the target categories of a data record and 0 
otherwise.

 

My problem is, that I get Inf-values for some validation data records, because 
the rowsum of cat*fittedValues gets 0 and the log gets Inf.

 

Has anyone an idea how to deal with that problem properly? How does nnet?

 

I´m thinking of a penalty value for those values. That means if 
cat*fittedValues == 0 not to calculate the log but add e.g. 100 instead of 
-log(tmp) to the finalValue-sum??

But how to determine the penalty value???

 

I´m looking forwar for all suggestions,

 

Andrea.

 

 

___
Jude Ryan
Director, Client Analytical Services
Strategy  Business Development
UBS Financial Services Inc.
1200 Harbor Boulevard, 4th Floor
Weehawken, NJ 07086-6791
Tel. 201-352-1935
Fax 201-272-2914
Email: jude.r...@ubs.com



Please do not transmit orders or instructions regarding a UBS 
account electronically, including but not limited to e-mail, 
fax, text or instant messaging. The information provided in 
this e-mail or any attachments is not an official transaction 
confirmation or account statement. For your protection, do not 
include account numbers, Social Security numbers, credit card 
numbers, passwords or other non-public information in your e-mail. 
Because the information contained in this message may be privileged, 
confidential, proprietary or otherwise protected from disclosure, 
please notify us immediately by replying to this message and 
deleting it from your computer if you have received this 
communication in error. Thank you. 

UBS Financial Services Inc. 
UBS International Inc. 
UBS Financial Services Incorporated of Puerto Rico 
UBS AG

 
UBS reserves the right to retain all messages. Messages are protected
and accessed only in legally justified cases.__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Comparing R and SAs

2009-06-10 Thread jude.ryan
Satish,

 

For a comparison of SAS and S, see the document An Introduction to S
and the Hmisc and Design Libraries by Carlos Alzola and Frank E.
Harrell. Frank Harrell is an expert in both SAS and R. You can download
this document from http://www.r-project.org/, then click on manuals, and
then contributed documentation. You can also look at the document
written by Bob Muenchen (at http://RforSASandSPSSusers.com
http://rforsasandspssusers.com/  (also a book published by Springer
Verlag) for a comparison of SAS and R (and SPSS).

 

I have been using both SAS and R. While my primary expertise is mainly
in SAS, I have been using R more and more relative to SAS as my
familiarity with it grows. From my point of view, cutting edge
methodologies will always be implemented first in R (as you pointed out
as well). SAS will follow several years later with some of these
methodologies. Also, SAS has different products and users may not have
all SAS products. Many firms have SAS/STAT but not other SAS products
like SAS/ETS (economic time series), SAS/Enterpriser Miner or SAS/GRAPH.
So in these situations R may be your only option. Even if you have these
other SAS products you can do things more rapidly in R, if you take the
time to learn it well, than you can with SAS. I have SAS/Enterprise
Miner but still prefer R for neural networks, splines, decision trees,
etc., as I can program R to produce several neural networks, etc. using
for loops. SAS/Enterprise Miner cannot be programmed. R graphs are
definitely superior to SAS graphics, and can be programmed very easily.
I also use R for EDA (exploratory data analysis) prior to building
predictive models/data mining.

 

One area where SAS still excels is in processing huge files (over 30 GB
in size - online data from vendors like double click with literally
billions of records). But for statistical analysis you generally don't
need to work with such large volumes of data. A much smaller random
sample should suffice. If you have R running on Unix or Linux 64-bit
operating systems (or Windows Vista?) and huge amounts of RAM handling
large datasets in R is less of an issue. Also, if your data resides on
mainframes, SAS is probably your only choice if you cannot download the
mainframe data to your PC. I use R on a 32-bit Windows operating system
with 3 GB of RAM, and I have not had any problem doing statistical
analysis/data mining with R on around 25,000 or so records with anywhere
from 25 to 50 variables.

 

Hope this helps.

 

Jude

 

Satish wrote:

 

Hi:

For those of you who are adept at both SAS and R, I have the following
questions:

 

a) What are some reasons / tasks for which you would use R over SAS and
vice versa?

b) What are some things for which R is a must have that SAS cannot
fulfill the requirements?

 

I am on the ramp up on both of them. The general feeling that I am
getting by following this group is that R updates to the product are at
a much faster pace and therefore, this would be better for someone who
wants the bleeding edge (correct me if I am wrong). But I am also
interested in what is inherently better in R that SAS cannot offer
perhaps because of the design.

 

Thanks.

Satish

 

 

___
Jude Ryan
Director, Client Analytical Services
Strategy  Business Development
UBS Financial Services Inc.
1200 Harbor Boulevard, 4th Floor
Weehawken, NJ 07086-6791
Tel. 201-352-1935
Fax 201-272-2914
Email: jude.r...@ubs.com



Please do not transmit orders or instructions regarding a UBS 
account electronically, including but not limited to e-mail, 
fax, text or instant messaging. The information provided in 
this e-mail or any attachments is not an official transaction 
confirmation or account statement. For your protection, do not 
include account numbers, Social Security numbers, credit card 
numbers, passwords or other non-public information in your e-mail. 
Because the information contained in this message may be privileged, 
confidential, proprietary or otherwise protected from disclosure, 
please notify us immediately by replying to this message and 
deleting it from your computer if you have received this 
communication in error. Thank you. 

UBS Financial Services Inc. 
UBS International Inc. 
UBS Financial Services Incorporated of Puerto Rico 
UBS AG

 
UBS reserves the right to retain all messages. Messages are protected
and accessed only in legally justified cases.__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] warning message when running quantile regression

2009-05-31 Thread jude.ryan
Hi All,

 

I am running quantile regression in a for loop starting with 1
variable and adding a variable at a time reaching a maximum of 20
variables.

I get the following warning messages after my for loop runs. Should I
be concerned about these messages? I am building predictive models and
am not interested in inference.

 

Warning messages:

1: In summary.rq(quantreg.emaff) : 3 non-positive fis   - I don't
understand this message - is this a cause for concern?

2: In summary.rq(quantreg.emaff) : 3 non-positive fis

3: In summary.rq(quantreg.emaff) : 5 non-positive fis

4: In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique

5: In summary.rq(quantreg.emaff) : 6 non-positive fis

6: In summary.rq(quantreg.emaff) : 5 non-positive fis

7: In summary.rq(quantreg.emaff) : 5 non-positive fis

8: In summary.rq(quantreg.emaff) : 7 non-positive fis

9: In summary.rq(quantreg.emaff) : 10 non-positive fis

10: In summary.rq(quantreg.emaff) : 9 non-positive fis

11: In summary.rq(quantreg.emaff) : 8 non-positive fis

12: In summary.rq(quantreg.emaff) : 9 non-positive fis

13: In summary.rq(quantreg.emaff) : 8 non-positive fis

14: In summary.rq(quantreg.emaff) : 11 non-positive fis

 

I understand the non-unique solution message.

 

Thanks in advance,

 

Jude Ryan

 

___
Jude Ryan
Director, Client Analytical Services
Strategy  Business Development
UBS Financial Services Inc.
1200 Harbor Boulevard, 4th Floor
Weehawken, NJ 07086-6791
Tel. 201-352-1935
Fax 201-272-2914
Email: jude.r...@ubs.com



Please do not transmit orders or instructions regarding a UBS 
account electronically, including but not limited to e-mail, 
fax, text or instant messaging. The information provided in 
this e-mail or any attachments is not an official transaction 
confirmation or account statement. For your protection, do not 
include account numbers, Social Security numbers, credit card 
numbers, passwords or other non-public information in your e-mail. 
Because the information contained in this message may be privileged, 
confidential, proprietary or otherwise protected from disclosure, 
please notify us immediately by replying to this message and 
deleting it from your computer if you have received this 
communication in error. Thank you. 

UBS Financial Services Inc. 
UBS International Inc. 
UBS Financial Services Incorporated of Puerto Rico 
UBS AG

 
UBS reserves the right to retain all messages. Messages are protected
and accessed only in legally justified cases.__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Backpropagation to adjust weights in a neural net when receiving new training examples

2009-05-29 Thread jude.ryan
You can figure out which weights go with which connections with the
function summary(nnet.object) and nnet.object$wts. Sample code from
Venables and Ripley is below:

 

# Neural Network model in Modern Applied Statistics with S, Venables and
Ripley, pages 246 and 247

 library(nnet)

 attach(rock)

 dim(rock)

[1] 48  4

 area1 - area/1; peri1 - peri/1

 rock1 - data.frame(perm, area = area1, peri = peri1, shape)

 dim(rock1)

[1] 48  4

 head(rock1)

  perm   area peri shape

1  6.3 0.4990 0.279190 0.0903296

2  6.3 0.7002 0.389260 0.1486220

3  6.3 0.7558 0.393066 0.1833120

4  6.3 0.7352 0.386932 0.1170630

5 17.1 0.7943 0.394854 0.1224170

6 17.1 0.7979 0.401015 0.1670450

 rock.nn - nnet(log(perm) ~ area + peri + shape, rock1, size=3,
decay=1e-3, linout=T, skip=T, maxit=1000, Hess=T)

# weights:  19

initial  value 1196.787489 

iter  10 value 32.400984

iter  20 value 31.664545

...

iter 280 value 14.230077

iter 290 value 14.229809

final  value 14.229785 

converged

 summary(rock.nn)

a 3-3-1 network with 19 weights

options were - skip-layer connections  linear output units  decay=0.001

 b-h1 i1-h1 i2-h1 i3-h1 

 -0.51  -9.33  14.59   3.85 

 b-h2 i1-h2 i2-h2 i3-h2 

  0.93   3.35   6.09  -5.86 

 b-h3 i1-h3 i2-h3 i3-h3 

  0.80 -10.93  -4.58   9.53 

  b-o  h1-o  h2-o  h3-o  i1-o  i2-o  i3-o 

  1.89 -14.62   7.35   8.77  -3.00  -4.25   4.44 

 sum((log(perm) - predict(rock.nn))^2)

[1] 13.20451

 rock.nn$wts

 [1]  -0.5064848  -9.3288410  14.5859255   3.8521844   0.9266730
3.3524267   6.0900909  -5.8628448   0.8026366 -10.9345352  -4.5783516
9.5311123

[13]   1.8866734 -14.6181959   7.3466236   8.7655882  -2.9988287
-4.2508948   4.4397158

 

 

In the output from summary(rock.nn), b is the bias or intercept, h1 is
the 1st hidden neuron, i1 is the first input (peri) and o is the
(linear) output. So b-h1 is the bias or intercept to the first hidden
neuron, i1-h1 is the 1st input (peri) to the first hidden neuron (there
are 3 hidden neurons in this example), h1-o is the 1st hidden neuron to
the first output, and i1-o is the first input to the output (since
skip=T - this is a skip layer network). The weights are below (b-h1 ..)
but are rounded. But rock.nn$wts gives you the un-rounded weights. If
you compare the output from summary(rock.nn) and rock.nn$wts you will
see that the first row of weights from summary() is listed first in
rock.nn$wts, followed by the 2nd row of weights from summary() and so
on.

 

You can construct the neural network equations manually (this is not in
the Venables Ripley book) and check the results against the predict()
function to verify that the weights are listed in the order I described.
The code to do this is:

 

# manually calculate the neural network predictions based on the neural
network equations

rock1$h1 - -0.5064848 -9.3288410 * rock1$area + 14.5859255 * rock1$peri
+ 3.8521844 * rock1$shape

rock1$logistic_h1 - exp(rock1$h1) / (1 + exp(rock1$h1))

rock1$h2 - 0.9266730 + 3.3524267 * rock1$area + 6.0900909 * rock1$peri
-5.8628448 * rock1$shape

rock1$logistic_h2 - exp(rock1$h2) / (1 + exp(rock1$h2))

rock1$h3 - 0.8026366 - 10.9345352 * rock1$area - 4.5783516 * rock1$peri
+ 9.5311123 * rock1$shape

rock1$logistic_h3 - exp(rock1$h3) / (1 + exp(rock1$h3))

rock1$pred1 - (1.8866734 - 14.6181959 * rock1$logistic_h1 + 7.3466236 *
rock1$logistic_h2 + 

8.7655882 * rock1$logistic_h3 - 2.9988287 * rock1$area -
4.2508948 * rock1$peri + 

4.4397158 * rock1$shape)

rock1$nn.pred - predict(rock.nn)

head(rock1)  

 

  perm   area peri shape h1 logistic_h1   h2
logistic_h2h3  logistic_h3pred1  nn.pred

1  6.3 0.4990 0.279190 0.0903296 -0.7413656   0.3227056 3.770238
0.9774726 -5.070985 0.0062370903 2.122910 2.122910

2  6.3 0.7002 0.389260 0.1486220 -0.7883026   0.3125333 4.773323
0.9916186 -7.219361 0.0007317343 1.514820 1.514820

3  6.3 0.7558 0.393066 0.1833120 -1.1178398   0.2464122 4.779515
0.9916699 -7.514112 0.0005450367 2.451231 2.451231

4  6.3 0.7352 0.386932 0.1170630 -1.2703391   0.2191992 5.061506
0.9937039 -7.892204 0.0003735057 2.656199 2.656199

5 17.1 0.7943 0.394854 0.1224170 -1.6854993   0.1563686 5.276490
0.9949156 -8.523675 0.0001986684 3.394902 3.394902

6 17.1 0.7979 0.401015 0.1670450 -1.4573040   0.100 5.064433
0.9937222 -8.165892 0.0002841023 3.072776 3.072776

 

The first 6 records show that the numbers from the manual equations and
the predict() function are the same (the last 2 columns). As VR point
out in their book, there are several solutions and a random starting
point and if you run the same example your results may differ.

 

Hope this helps.

 

Jude Ryan

 

Filipe Rocha wrote:

 

I want to create a neural network, and then everytime it receives new
data,

instead of  creating a new nnet, i want to use a backpropagation
algorithm

to adjust the weights in the already created nn.

I'm using nnet package, I know that nn$wts gives the weights,  

Re: [R] Backpropagation to adjust weights in a neural net when receiving new training examples

2009-05-29 Thread jude.ryan
Not that I know of.

If you do come across any, let me know, or better still, email r-help.

Good luck with what you are trying to do.

 

Jude Ryan

 



From: Filipe Rocha [mailto:filipemaro...@gmail.com] 
Sent: Friday, May 29, 2009 1:17 PM
To: Ryan, Jude
Cc: r-help@r-project.org
Subject: Re: [R] Backpropagation to adjust weights in a neural net when
receiving new training examples



Thanks a lot for your answers. I can try to implement backpropagation
myself with that information.

But there isnt a function or method of backpropagation of error for new
examples of training only to change the already created neural net?

I want to implemennt reinforcement learning...

Thanks in advance



Filipe Rocha

2009/5/29 jude.r...@ubs.com

You can figure out which weights go with which connections with the
function summary(nnet.object) and nnet.object$wts. Sample code from
Venables and Ripley is below:



# Neural Network model in Modern Applied Statistics with S, Venables and
Ripley, pages 246 and 247

 library(nnet)

 attach(rock)

 dim(rock)

[1] 48  4

 area1 - area/1; peri1 - peri/1

 rock1 - data.frame(perm, area = area1, peri = peri1, shape)

 dim(rock1)

[1] 48  4

 head(rock1)

  perm   area peri shape

1  6.3 0.4990 0.279190 0.0903296

2  6.3 0.7002 0.389260 0.1486220

3  6.3 0.7558 0.393066 0.1833120

4  6.3 0.7352 0.386932 0.1170630

5 17.1 0.7943 0.394854 0.1224170

6 17.1 0.7979 0.401015 0.1670450

 rock.nn - nnet(log(perm) ~ area + peri + shape, rock1, size=3,
decay=1e-3, linout=T, skip=T, maxit=1000, Hess=T)

# weights:  19

initial  value 1196.787489 

iter  10 value 32.400984

iter  20 value 31.664545

...

iter 280 value 14.230077

iter 290 value 14.229809

final  value 14.229785 

converged

 summary(rock.nn)

a 3-3-1 network with 19 weights

options were - skip-layer connections  linear output units  decay=0.001

 b-h1 i1-h1 i2-h1 i3-h1 

 -0.51  -9.33  14.59   3.85 

 b-h2 i1-h2 i2-h2 i3-h2 

  0.93   3.35   6.09  -5.86 

 b-h3 i1-h3 i2-h3 i3-h3 

  0.80 -10.93  -4.58   9.53 

  b-o  h1-o  h2-o  h3-o  i1-o  i2-o  i3-o 

  1.89 -14.62   7.35   8.77  -3.00  -4.25   4.44 

 sum((log(perm) - predict(rock.nn))^2)

[1] 13.20451

 rock.nn$wts

 [1]  -0.5064848  -9.3288410  14.5859255   3.8521844   0.9266730
3.3524267   6.0900909  -5.8628448   0.8026366 -10.9345352  -4.5783516
9.5311123

[13]   1.8866734 -14.6181959   7.3466236   8.7655882  -2.9988287
-4.2508948   4.4397158

 



In the output from summary(rock.nn), b is the bias or intercept, h1 is
the 1st hidden neuron, i1 is the first input (peri) and o is the
(linear) output. So b-h1 is the bias or intercept to the first hidden
neuron, i1-h1 is the 1st input (peri) to the first hidden neuron (there
are 3 hidden neurons in this example), h1-o is the 1st hidden neuron to
the first output, and i1-o is the first input to the output (since
skip=T - this is a skip layer network). The weights are below (b-h1 ..)
but are rounded. But rock.nn$wts gives you the un-rounded weights. If
you compare the output from summary(rock.nn) and rock.nn$wts you will
see that the first row of weights from summary() is listed first in
rock.nn$wts, followed by the 2nd row of weights from summary() and so
on.



You can construct the neural network equations manually (this is not in
the Venables Ripley book) and check the results against the predict()
function to verify that the weights are listed in the order I described.
The code to do this is:



# manually calculate the neural network predictions based on the neural
network equations

rock1$h1 - -0.5064848 -9.3288410 * rock1$area + 14.5859255 * rock1$peri
+ 3.8521844 * rock1$shape

rock1$logistic_h1 - exp(rock1$h1) / (1 + exp(rock1$h1))

rock1$h2 - 0.9266730 + 3.3524267 * rock1$area + 6.0900909 * rock1$peri
-5.8628448 * rock1$shape

rock1$logistic_h2 - exp(rock1$h2) / (1 + exp(rock1$h2))

rock1$h3 - 0.8026366 - 10.9345352 * rock1$area - 4.5783516 * rock1$peri
+ 9.5311123 * rock1$shape

rock1$logistic_h3 - exp(rock1$h3) / (1 + exp(rock1$h3))

rock1$pred1 - (1.8866734 - 14.6181959 * rock1$logistic_h1 + 7.3466236 *
rock1$logistic_h2 + 

8.7655882 * rock1$logistic_h3 - 2.9988287 * rock1$area -
4.2508948 * rock1$peri + 

4.4397158 * rock1$shape)

rock1$nn.pred - predict(rock.nn)

head(rock1)  



  perm   area peri shape h1 logistic_h1   h2
logistic_h2h3  logistic_h3pred1  nn.pred

1  6.3 0.4990 0.279190 0.0903296 -0.7413656   0.3227056 3.770238
0.9774726 -5.070985 0.0062370903 2.122910 2.122910

2  6.3 0.7002 0.389260 0.1486220 -0.7883026   0.3125333 4.773323
0.9916186 -7.219361 0.0007317343 1.514820 1.514820

3  6.3 0.7558 0.393066 0.1833120 -1.1178398   0.2464122 4.779515
0.9916699 -7.514112 0.0005450367 2.451231 2.451231

4  6.3 0.7352 0.386932 0.1170630 -1.2703391   0.2191992 5.061506
0.9937039 -7.892204 0.0003735057 2.656199 2.656199

5 17.1 0.7943 0.394854 0.1224170 -1.6854993   0.1563686 5.276490

Re: [R] Neural Network resource

2009-05-28 Thread jude.ryan
The package AMORE appears to be more flexible, but I got very poor
results using it when I tried to improve the predictive accuracy of a
regression model. I don't understand all the options well enough to be
able to fine tune it to get better predictions. However, using the
nnet() function in package VR gave me decent results and is pretty easy
to use (see the Venables and Ripley book, Modern Applied Statistics with
S, pages 243 to 249, for more details). I tried using package neuralnet
as well but the neural net failed to converge. I could not figure out
how to set the threshold option (or other options) to get the neural net
to converge. I explored package neural as well. Of all these 4 packages,
the nnet() function in package VR worked the best for me.

 

As another R user commented as well, you have too many hidden layers and
too many neurons. In general you do not need more than 1 hidden layer.
One hidden layer is sufficient for the universal approximator property
of neural networks to hold true. As you keep adding neurons to the one
hidden layer, the problem becomes more and more non-linear. If you add
too many neurons you will overfit. In general, you do not need to add
more than 10 neurons. The activation function in the hidden layer of
Venables and Ripley's nnet() function is logistic, and you can specify
the activation function in the output layer to be linear using linout =
T in nnet(). Using one hidden layer, and starting with one hidden neuron
and working up to 10 hidden neurons, I built several neural nets (4,000
records) and computed the training MSE. I also computed the validation
MSE on a holdout sample of over 1,000 records. I also started with 2
variables and worked up to 15 variables in a for loop, so in all, I
built 140 neural nets using 2 for loops, and stored the results in
lists. I arranged my variables in the data frame based on correlations
and partial correlations so that I could easily add variables in a for
loop. This was my crude attempt to simulate variable selection since,
from what I have seen, neural networks do not have variable selection
methods. In my particular case, neural networks gave me marginally
better results than regression. It all depends on the problem. If the
data has non-linear patterns, neural networks will be better than linear
regression.

 

My code is below. You can modify it to suit your needs if you find it
useful. There are probably lines in the code that are redundant which
can be deleted.

 

HTH.

 

Jude Ryan

 

My code:

 

# set order in data frame train2 based on correlations and partial
correlations

train2 - train[, c(5,27,19,20,25,26,4,9,3,10,16,6,2,14,21,28)]

dim(train2)

names(train2)

library(nnet)

# skip = T

# train 10 neural networks in a loop and find the one with the minimum
test and validation error

# create various lists to store the results of the neural network
running in two for loops

# The Column List is for the outer for loop, which loops over variables

# The Row List is for the inner for loop, which loops over number of
neurons in the hidden layer

col_nn - list()  # stores the results of nnet() over variables - outer
loop

row_nn - list()  # stores the results of nnet() over neurons - inner
loop

col_mse - list()

# row_mse - list() # not needed because nn.mse is a data frame with
rows

col_sum - list()

row_sum - list()

col_vars - list()

row_vars - list()

col_wts - list()

row_wts - list()

df_dim - dim(train2)

df_dim[2]  # number of variables

df_dim[2] - 1

num_of_neurons - 10

# build data frame to store results of neural net for each run

nn.mse - data.frame(Train_MSE=seq(1:num_of_neurons),
Valid_MSE=seq(1:num_of_neurons))

# open log file and redirect output to log file

sink(D:\\XXX\\YYY\\ Programs\\Neural_Network_v8_VR_log.txt)

# outer loop - loop over variables

for (i in 3:df_dim[2]) {  # df_dim[2]

  # inner loop - loop over number of hidden neurons

  for (j in 1:num_of_neurons) { # upto 10 neurons in the hidden layer

# need to create a new data frame with just the predictor/input
variables needed

train3 - train2[,c(1:i)]

coreaff.nn - nnet(dep_var ~ ., train3, size = j, decay = 1e-3,
linout = T, skip = T, maxit = 1000, Hess = T)

# row_vars[[j]] - coreaff.nn$call # not what we want

# row_vars[[j]] - names(train3)[c(2:i)] # not needed in inner loop
- same number of variables for all neurons

row_sum[[j]] - summary(coreaff.nn)

row_wts[[j]] - coreaff.nn$wts

rownames(nn.mse)[j] - paste(H, j, sep=)

nn.mse[j, Train_MSE] - mean((train3$dep_var -
predict(coreaff.nn))^2)

nn.mse[j, Valid_MSE] - mean((valid$dep_var - predict(coreaff.nn,
valid))^2)

  }

  col_vars[[i-2]] - names(train3)[c(2:i)]

  col_sum[[i-2]] - row_sum

  col_wts[[i-2]] - row_wts

  col_mse[[i-2]] - nn.mse

}

# cbind(col_vars[1],col_vars[2])

col_vars

col_sum

col_wts

sink()

cbind(col_mse[[1]],col_mse[[2]],col_mse[[3]],col_mse[[4]],col_mse[[5]],c
ol_mse[[6]],col_mse[[7]],

 

[R] mathematical model/equations for neural network in library(nnet)

2009-05-14 Thread jude.ryan
Hi All,

 

I am trying to manually extract the scoring equations for a neural
network so that I can score clients on a system that does not have R
(mainframe using COBOL).

Using the example in Modern Applied Statistics with S (MASS), by
Venables and Ripley (VR), pages 246 and 247, I ran the following neural
network. The code is the same as in VR pages 246 and 247 except I have
skip = F. The equation will have 3 more terms if skip = T.

 

library(nnet)

attach(rock)

area1 - area/1; peri1 - peri/1

rock1 - data.frame(perm, area = area1, peri = peri1, shape)

# skip = F

rock2.nn - nnet(log(perm) ~ area + peri + shape, rock1, size=3,
decay=1e-3, linout=T, skip=F, maxit=1000, Hess=T)

 

# weights:  16

initial  value 1420.968942 

iter  10 value 96.823665

iter  20 value 32.177295

iter  30 value 25.012430

iter  40 value 23.109650

iter  50 value 20.981236

iter  60 value 15.019016

iter  70 value 14.082190

iter  80 value 14.042717

iter  90 value 13.931124

iter 100 value 13.883691

iter 110 value 13.877307

iter 120 value 13.875051

iter 130 value 13.873667

final  value 13.873634 

converged

 

summary(rock2.nn)

 

The output from summary(rock2.nn) is:

 

a 3-3-1 network with 16 weights

options were - linear output units  decay=0.001

 b-h1 i1-h1 i2-h1 i3-h1 

 10.65  -8.90 -14.63   6.17 

 b-h2 i1-h2 i2-h2 i3-h2 

 -0.72  11.76 -17.17  -1.56 

 b-h3 i1-h3 i2-h3 i3-h3 

  2.96  -9.03  -8.07  -2.54 

  b-o  h1-o  h2-o  h3-o 

 -6.91   2.45  11.53   9.22

 

Following the mathematical model / equations shown in VR (pages 243 to
247) and another book on neural networks, I extracted the neural network
equations manually, and scored the dataset rock1, and compared the
manual scores I obtained with the scores from predict(). They were
totally different, and I am not sure what I am doing wrong. If anyone
can give me some pointers I would appreciate it.

 

The mathematical model/equations I come up with from the weights are:

 

# manual calculate neural network predictions based on neural network
equations

rock1$h1 - 10.65 - 8.9 * rock1$area - 14.63 * rock1$peri + 6.17 *
rock1$shape

rock1$logistic_h1 - exp(rock1$h1) / (1 + exp(rock1$h1))

rock1$h2 - -0.72 + 11.76 * rock1$area - 11.17 * rock1$peri - 1.56 *
rock1$shape

rock1$logistic_h2 - exp(rock1$h2) / (1 + exp(rock1$h2))

rock1$h3 - 2.96 - 9.03 * rock1$area - 8.07 * rock1$peri - 2.54 *
rock1$shape

rock1$logistic_h3 - exp(rock1$h3) / (1 + exp(rock1$h3))

# predictions based on manual scoring

rock1$pred_perm - -6.91 + 2.45 * rock1$logistic_h1 + 11.53 *
rock1$logistic_h2 + 9.22 * rock1$logistic_h3

# predictions using predict() and object that has the output of the
neural network

rock1$nn_pred_perm - predict(rock2.nn)

rock1$log_perm - log(rock1$perm)

head(rock1)

 

  perm   area peri shape h1 logistic_h1   h2
logistic_h2h3  logistic_h3 pred_perm nn_pred_perm log_perm

1  6.3 0.4990 0.279190 0.0903296  2.6816839   0.9359372 1.888774
0.8686156 -4.028470 0.0174901847  5.559444 1.920348 1.840550

2  6.3 0.7002 0.389260 0.1486220 -0.3596561   0.4110428 2.934467
0.9495242 -6.881634 0.0010254128  5.054524 1.546815 1.840550

3  6.3 0.7558 0.393066 0.1833120 -0.6961405   0.3326685 3.491694
0.9704505 -7.502529 0.0005513831  5.099416 2.630932 1.840550

4  6.3 0.7352 0.386932 0.1170630 -0.8318165   0.3032611 3.421303
0.9683637 -7.098737 0.0008254655  5.005834 2.489565 1.840550

5 17.1 0.7943 0.394854 0.1224170 -1.4406711   0.1914414 4.019478
0.9823546 -7.709940 0.0004481475  4.889712 3.235397 2.839078

6 17.1 0.7979 0.401015 0.1670450 -1.2874918   0.2162777 3.923376
0.9806092 -7.905522 0.0003685659  4.929703 3.078584 2.839078

 

sum((log(perm) - rock1$nn_pred_perm)^2)

[1] 12.55929

 

sum((log(perm) - rock1$pred_perm)^2)

[1] 82.63254

 

Thanks in advance,

 

Jude

 

 

___
Jude Ryan
Director, Client Analytical Services
Strategy  Business Development
UBS Financial Services Inc.
1200 Harbor Boulevard, 4th Floor
Weehawken, NJ 07086-6791
Tel. 201-352-1935
Fax 201-272-2914
Email: jude.r...@ubs.com



Please do not transmit orders or instructions regarding a UBS 
account electronically, including but not limited to e-mail, 
fax, text or instant messaging. The information provided in 
this e-mail or any attachments is not an official transaction 
confirmation or account statement. For your protection, do not 
include account numbers, Social Security numbers, credit card 
numbers, passwords or other non-public information in your e-mail. 
Because the information contained in this message may be privileged, 
confidential, proprietary or otherwise protected from disclosure, 
please notify us immediately by replying to this message and 
deleting it from your computer if you have received this 
communication in error. Thank you. 

UBS Financial Services Inc. 
UBS International Inc. 
UBS Financial Services Incorporated of Puerto Rico 
UBS AG

 
UBS reserves the 

[R] neural network not using all observations

2009-05-12 Thread jude.ryan
I am exploring neural networks (adding non-linearities) to see if I can
get more predictive power than a linear regression model I built. I am
using the function nnet and following the example of Venables and
Ripley, in Modern Applied Statistics with S, on pages 246 to 249. I have
standardized variables (z-scores) such as assets, age and tenure. I have
other variables that are binary (0 or 1). In max_acc_ownr_nwrth_n_med
for example, the variable has a value of 1 if the client's net worth is
above the median net worth and a value of 0 otherwise. These are derived
variable I created and variables that the regression algorithm has found
to be predictive. A regression on the same variables shown below gives
me an R-Square of about 0.12. I am trying to increase the predictive
power of this regression model with a neural network being careful to
avoid overfitting.

Similar to Venables and Ripley, I used the following code:

 

 library(nnet)

 dim(coreaff.trn.nn)

[1] 50888

 head(coreaff.trn.nn)

  hh.iast.y WC_Total_Assets all_assets_per_hh age  tenure
max_acc_ownr_liq_asts_n_med max_acc_ownr_nwrth_n_med
max_acc_ownr_ann_incm_n_med

1   3059448  -0.4692186-0.4173532 -0.06599001 -1.04747935
01   0

2   4899746   3.4854334 4.064 -0.06599001 -0.72540200
11   1

3727333  -0.2677357-0.4177944 -0.30136473 -0.40332465
11   1

4443138  -0.5295170-0.6999646 -0.1825 -1.04747935
00   0

5484253  -0.6112205-0.7306664  0.64013414  0.07979137
10   0

6799054   0.6580506 1.1763114  0.24784295  0.07979137
01   1

 coreaff.nn1 - nnet(hh.iast.y ~ WC_Total_Assets + all_assets_per_hh +
age + tenure + max_acc_ownr_liq_asts_n_med +

+ max_acc_ownr_nwrth_n_med +
max_acc_ownr_ann_incm_n_med, coreaff.trn.nn, size = 2, decay = 1e-3,

+ linout = T, skip = T, maxit = 1000, Hess = T)

# weights:  26

initial  value 12893652845419998.00 

iter  10 value 6352515847944854.00

final  value 6287104424549762.00 

converged

 summary(coreaff.nn1)

a 7-2-1 network with 26 weights

options were - skip-layer connections  linear output units  decay=0.001

 b-h1 i1-h1 i2-h1 i3-h1 i4-h1 i5-h1
i6-h1 i7-h1 

 -21604.84   -2675.80   -5001.90   -1240.16-335.44  -12462.51
-13293.80   -9032.34 

 b-h2 i1-h2 i2-h2 i3-h2 i4-h2 i5-h2
i6-h2 i7-h2 

 210841.52   47296.92   58100.43  -13819.10   -9195.80  117088.99
131939.57  106994.47 

  b-o  h1-o  h2-o  i1-o  i2-o  i3-o
i4-o  i5-o  i6-o  i7-o 

1115190.67  894123.33 -417269.57   89621.84  170268.12   44833.63
59585.05  112405.30  437581.05  244201.69

 sum((hh.iast.y - predict(coreaff.nn1))^2)  

Error: object hh.iast.y not found

 

So I try:

 

 sum((coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1))^2)

Error: dims [product 5053] do not match the length of object [5088]

In addition: Warning message:

In coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1) :

  longer object length is not a multiple of shorter object length

 

Doing a little debugging:

 

 pred - predict(coreaff.nn1)

 dim(pred)

[1] 50531

 dim(coreaff.trn.nn)

[1] 50888

 

So it looks like the dimensions (number of records/cases) of the vector
pred is 5,053 and the number of records of the input dataset is 5,088.

 

It looks like the neural network is dropping 35 records. Does anyone
have any idea of why it would do this? It is most probably because those
35 records are bad data, a pretty common occurrence in the real world.
Does anyone know how I can identify the dropped records? If I can do
this I can get the dimensions of the input dataset to be 5,053 and then:

 

 sum((coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1))^2)

 

would work.

 

A summary of my dataset is:

 

 summary(coreaff.trn.nn)

   hh.iast.yWC_Total_Assets  all_assets_per_hh age
tenure   max_acc_ownr_liq_asts_n_med

 Min.   :   0   Min.   :-6.970e-01   Min.   :-8.918e-01   Min.
:-4.617e+00   Min.   :-1.209e+00   Min.   :0. 

 1st Qu.:  565520   1st Qu.:-5.387e-01   1st Qu.:-6.147e-01   1st
Qu.:-4.583e-01   1st Qu.:-7.254e-01   1st Qu.:0. 

 Median :  834164   Median :-3.160e-01   Median :-3.718e-01   Median :
9.093e-02   Median :-2.423e-01   Median :0. 

 Mean   : 1060244   Mean   : 2.948e-13   Mean   : 3.204e-12   Mean
:-1.884e-11   Mean   :-3.302e-12   Mean   :0.4951 

 3rd Qu.: 1207181   3rd Qu.: 1.127e-01   3rd Qu.: 1.891e-01   3rd Qu.:
5.617e-01   3rd Qu.: 5.629e-01   3rd Qu.:1. 

 Max.   :45003160   Max.   : 

[R] FW: neural network not using all observations

2009-05-12 Thread jude.ryan
As a follow-up to my email below:

 

The input data frame to nnet() has dimensions:

 

 dim(coreaff.trn.nn)

[1] 50888

 

And the predictions from the neural network (35 records are dropped -
see email below for more details) has dimensions:

 

 pred - predict(coreaff.nn1)

 dim(pred)

[1] 50531

 

So, the following line of R code does not work as the dimensions are
different.

 

 sum((coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1))^2)

Error: dims [product 5053] do not match the length of object [5088]

In addition: Warning message:

In coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1) :

  longer object length is not a multiple of shorter object length

 

While:

 

 dim(pred)

[1] 50531

 

 tail(pred)

  [,1]

5083  664551.9

5084  552170.6

5085  684834.3

5086 1215282.5

5087 1116302.2

5088  658112.1

 

shows that the last row of pred is 5,088, which corresponds to the
dimension of coreaff.trn.nn, the input data frame to the neural network.

 

I tried using row() to identify the 35 records that were dropped (or not
scored). The code I tried was:

 

 coreaff.trn.nn.subset - coreaff.trn.nn[row(coreaff.trn.nn) ==
row(pred), ]

Error in row(coreaff.trn.nn) == row(pred) : non-conformable arrays

 

But I am not doing something right. pred has dimension = 1 and row()
requires an object of dimension = 2. So using cbind() I bound a column
of sequence numbers to pred to make the dimension = 2 but that did not
help.

 

Basically, if I can identify the 5,053 records that the neural network
made predictions for, in the data frame of 5,088 records
(coreaff.trn.nn) used by the neural network, then I can compare the
predictions to the actual values, and compare the predictive power of
the neural network to the predictive power of the linear regression
model.

 

Any idea how I can extract the 5,053 records that the neural network
made predictions for from the data frame (5,088 records) used to train
the neural network? 

 

Thanks in advance,

 

Jude

 



From: Ryan, Jude 
Sent: Tuesday, May 12, 2009 11:11 AM
To: 'r-help@r-project.org'
Cc: juderya...@yahoo.com
Subject: neural network not using all observations



I am exploring neural networks (adding non-linearities) to see if I can
get more predictive power than a linear regression model I built. I am
using the function nnet and following the example of Venables and
Ripley, in Modern Applied Statistics with S, on pages 246 to 249. I have
standardized variables (z-scores) such as assets, age and tenure. I have
other variables that are binary (0 or 1). In max_acc_ownr_nwrth_n_med
for example, the variable has a value of 1 if the client's net worth is
above the median net worth and a value of 0 otherwise. These are derived
variable I created and variables that the regression algorithm has found
to be predictive. A regression on the same variables shown below gives
me an R-Square of about 0.12. I am trying to increase the predictive
power of this regression model with a neural network being careful to
avoid overfitting.

Similar to Venables and Ripley, I used the following code:



 library(nnet)

 dim(coreaff.trn.nn)

[1] 50888

 head(coreaff.trn.nn)

  hh.iast.y WC_Total_Assets all_assets_per_hh age  tenure
max_acc_ownr_liq_asts_n_med max_acc_ownr_nwrth_n_med
max_acc_ownr_ann_incm_n_med

1   3059448  -0.4692186-0.4173532 -0.06599001 -1.04747935
01   0

2   4899746   3.4854334 4.064 -0.06599001 -0.72540200
11   1

3727333  -0.2677357-0.4177944 -0.30136473 -0.40332465
11   1

4443138  -0.5295170-0.6999646 -0.1825 -1.04747935
00   0

5484253  -0.6112205-0.7306664  0.64013414  0.07979137
10   0

6799054   0.6580506 1.1763114  0.24784295  0.07979137
01   1

 coreaff.nn1 - nnet(hh.iast.y ~ WC_Total_Assets + all_assets_per_hh +
age + tenure + max_acc_ownr_liq_asts_n_med +

+ max_acc_ownr_nwrth_n_med +
max_acc_ownr_ann_incm_n_med, coreaff.trn.nn, size = 2, decay = 1e-3,

+ linout = T, skip = T, maxit = 1000, Hess = T)

# weights:  26

initial  value 12893652845419998.00 

iter  10 value 6352515847944854.00

final  value 6287104424549762.00 

converged

 summary(coreaff.nn1)

a 7-2-1 network with 26 weights

options were - skip-layer connections  linear output units  decay=0.001

 b-h1 i1-h1 i2-h1 i3-h1 i4-h1 i5-h1
i6-h1 i7-h1 

 -21604.84   -2675.80   -5001.90   -1240.16-335.44  -12462.51
-13293.80   -9032.34 

 b-h2 i1-h2 i2-h2 i3-h2 i4-h2 i5-h2
i6-h2 i7-h2 

 210841.52   47296.92   58100.43  -13819.10   

[R] How do I extract the scoring equations for neural networks and support vector machines?

2009-05-12 Thread jude.ryan
Sorry for these multiple postings.

I solved the problem using na.omit() to drop records with missing values
for the time being. I will worry about imputation, etc. later.

 

I calculated the sum of squared errors for 3 models, linear regression,
neural networks, and support vector machines. This is the first run.
Without doing any parameter tuning on the SVM or playing around with the
number of nodes in the hidden layer of the neural network, I found that
the SVM had the lowest sum of squared errors, followed by neural
networks, with regression being last. This probably indicates that the
data has non-linear patterns.

 

I have a couple of questions.

1) Besides sum of squared errors, are there any other metrics that can
be used to compare these 3 models? AIC, BIC, etc, can be used for
regressions, but I am not sure whether they can be used for SVM's and
neural networks.

2) Is there any easy way to extract the scoring equations for SVM's and
neural networks? Using the R objects I can always score new data
manually but the model will need to be implemented in a production
environment. When the model gets implemented in production (could be the
mainframe) I will need equations that can be coded in any language
(COBOL or SAS on the mainframe). Also, getting the scoring equations for
all 3 models will let me create an ensemble model where the predicted
value could be the average of the predictions from the SVM, neural
network and linear regression. If the ensemble model has the smallest
sum of squared errors this would be the model I would use.

 

I have SAS Enterprise Miner as well and can get a scoring equation for
the neural network (I don't have SVM), but the scoring code that SAS EM
generates sucks and I would much rather extract a scoring equation from
R. I am using nnet() for the neural network.

 

Thanks in advance,

 

Jude Ryan

 



From: Ryan, Jude 
Sent: Tuesday, May 12, 2009 1:23 PM
To: 'r-help@r-project.org'
Cc: juderya...@yahoo.com
Subject: FW: neural network not using all observations



As a follow-up to my email below:



The input data frame to nnet() has dimensions:



 dim(coreaff.trn.nn)

[1] 50888



And the predictions from the neural network (35 records are dropped -
see email below for more details) has dimensions:



 pred - predict(coreaff.nn1)

 dim(pred)

[1] 50531



So, the following line of R code does not work as the dimensions are
different.



 sum((coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1))^2)

Error: dims [product 5053] do not match the length of object [5088]

In addition: Warning message:

In coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1) :

  longer object length is not a multiple of shorter object length



While:



 dim(pred)

[1] 50531



 tail(pred)

  [,1]

5083  664551.9

5084  552170.6

5085  684834.3

5086 1215282.5

5087 1116302.2

5088  658112.1



shows that the last row of pred is 5,088, which corresponds to the
dimension of coreaff.trn.nn, the input data frame to the neural network.



I tried using row() to identify the 35 records that were dropped (or not
scored). The code I tried was:



 coreaff.trn.nn.subset - coreaff.trn.nn[row(coreaff.trn.nn) ==
row(pred), ]

Error in row(coreaff.trn.nn) == row(pred) : non-conformable arrays



But I am not doing something right. pred has dimension = 1 and row()
requires an object of dimension = 2. So using cbind() I bound a column
of sequence numbers to pred to make the dimension = 2 but that did not
help.



Basically, if I can identify the 5,053 records that the neural network
made predictions for, in the data frame of 5,088 records
(coreaff.trn.nn) used by the neural network, then I can compare the
predictions to the actual values, and compare the predictive power of
the neural network to the predictive power of the linear regression
model.



Any idea how I can extract the 5,053 records that the neural network
made predictions for from the data frame (5,088 records) used to train
the neural network? 



Thanks in advance,



Jude





From: Ryan, Jude 
Sent: Tuesday, May 12, 2009 11:11 AM
To: 'r-help@r-project.org'
Cc: juderya...@yahoo.com
Subject: neural network not using all observations



I am exploring neural networks (adding non-linearities) to see if I can
get more predictive power than a linear regression model I built. I am
using the function nnet and following the example of Venables and
Ripley, in Modern Applied Statistics with S, on pages 246 to 249. I have
standardized variables (z-scores) such as assets, age and tenure. I have
other variables that are binary (0 or 1). In max_acc_ownr_nwrth_n_med
for example, the variable has a value of 1 if the client's net worth is
above the median net worth and a value of 0 otherwise. These are derived
variable I created and variables that the regression algorithm has found
to be predictive. A regression on the same variables shown below gives
me an R-Square of 

[R] reading version 9 SAS datasets in R

2008-12-03 Thread jude.ryan
Hi,

 

I am trying to read a SAS version 9.1.3 SAS dataset into R (to preserve
the SAS labels), but am unable to do so (I have read in a CSV version).
I first created a transport file using the SAS code:

 

libname ces2 'D:\CES Analysis\Data';

filename transp 'D:\CES Analysis\Data\fadata.xpt';

 

/* create a transport file - R cannot read file created by proc cport */

proc cport data=ces2.fadata file=transp;

run;

 

I then tried to read it in R using:

 

 library(foreign)

 library(Hmisc)

 fadata2 - sasxport.get(D:\\CES Analysis\\Data\\fadata.xpt)

Error in lookup.xport(file) : file not in SAS transfer format

 

Next I tried using the libname statement and the xport engine to create
a transport file. The problem with this method is that variable names
cannot be more than 8 characters as this method creates a SAS version 6
transport file. 

 

libname to_r xport 'D:\CES Analysis\Data\fadata2.xpt';

 

data to_r.fadata2;

  set ces2.fadata;

run;

 

But I get an error message in the SAS log:

 

493  libname to_r xport 'D:\CES Analysis\Data\fadata2.xpt';

NOTE: Libref TO_R was successfully assigned as follows:

  Engine:XPORT

  Physical Name: D:\CES Analysis\Data\fadata2.xpt

494

495  data to_r.fadata2;

496set ces2.fadata;

497  run;

 

ERROR: The variable name BUS_TEL_N is illegal for the version 6 file
TO_R.FADATA2.DATA.

NOTE: The SAS System stopped processing this step because of errors.

WARNING: The data set TO_R.FADATA2 was only partially opened and will
not be saved.

 

Next I tried other ways of reading a SAS dataset in R, as shown below:

 

fadata2 - sas.get(D:\\CES Analysis\\Data, mem=fadata)

Error in sas.get(D:\\CES Analysis\\Data, mem = fadata) : 

  Unix file, D:\CES Analysis\Data/c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .sd2 D:\CES Analysis\Data/c(NA, 64716,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 64716, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 

In addition: Warning message:

In sas.get(D:\\CES Analysis\\Data, mem = fadata) :

  D:\CES Analysis\Data/formats.sc? or formats.sas7bcat  not found.
Formatting ignored. 

 

 ls()

[1] fadata

 ?read.xport

 fadata2 - read.xport(D:\\CES Analysis\\Data\\fadata.xpt)

Error in lookup.xport(file) : file not in SAS transfer format

 ?read.ssd

 fadata2 - read.ssd(D:\\CES Analysis\\Data, fadata)

SAS failed.  SAS program at
D:\DOCUME~1\re06572\LOCALS~1\Temp\RtmpLqCVUx\file72ae2cd6.sas 

The log file will be file72ae2cd6.log in the current directory

Warning messages:

1: In system(paste(sascmd, tmpProg)) : sas not found

2: In read.ssd(D:\\CES Analysis\\Data, fadata) :

  SAS return code was -1

 sashome - C:\\Program Files\\SAS\\SAS 9.1

 fadata2 - read.ssd(file.path(sashome, core, sashelp), fadata,
sascmd=file.path(sashome, sas.exe))

SAS failed.  SAS program at
D:\DOCUME~1\re06572\LOCALS~1\Temp\RtmpLqCVUx\file6df11649.sas 

The log file will be file6df11649.log in the current directory

Warning message:

In read.ssd(file.path(sashome, core, sashelp), fadata, sascmd =
file.path(sashome,  :

  SAS return code was 2

 

 

Is there any way I can read in a SAS version 9 dataset in R, so that I
can preserve the SAS labels?

If I have to change the SAS variable names to be 8 characters or less,
to create a SAS version 6 transport file, I could probably do without
the SAS labels as I have already read in the data into R from a CSV
file.

 

Thanks in advance for any help.

 

Jude

 

___
Jude Ryan
Director, Client Analytic Services
Strategy  Business Development
UBS Financial Services Inc.
1200 Harbor Boulevard, 4th Floor
Weehawken, NJ 07086-6791
Tel. 201-352-1935
Fax 201-272-2914
Email: [EMAIL PROTECTED]

 


























Please do not transmit orders or instructions regarding a UBS account by 
e-mail. The information provided in this e-mail or any attachments is not an 
official transaction confirmation or account statement. For your protection, do 
not include account numbers, Social Security numbers, credit card numbers, 
passwords or other non-public information in your e-mail. Because the 
information contained in this message may be privileged, confidential, 
proprietary or otherwise