Re: [R] [FORGED] Regression with factors ?

David Winsemius Wed, 13 Jul 2016 10:00:18 -0700

> On Jul 13, 2016, at 8:01 AM, David Winsemius <dwinsem...@comcast.net> wrote:
> 
> 
>> On Jul 13, 2016, at 6:48 AM, stn021 <stn...@gmail.com> wrote:
>> 
>> Hello,
>> 
>> so here a numerical example in R-code. Code is appended below.
>> 
>> The output should be
>> 1) the numerical values of the abilities of the persons
>> 2) the multiplyer
>> 
>> 
>> Please note that
>> 
>> 1) I have used non-linear optimization to solve this problem and got
>> the expected result, though not with R but other software.
>> 
>> 2) I have applied lm() to this problem, even before I posted the
>> question. I am well aware of the syntax of formulas. I my last posting
>> I wrote the formula "freehand" so I made the previously mentioned
>> errors. Sorry about that.
>> 
>> 
>> 
>> Unfortunately the formulas with I() as well as multiplying variables
>> before running R does not work here. I() does not apply to factors (R
>> tells me) and multiplying in advance also works only for continuous
>> variables, not for factors, because there is no known numerical value
>> to multiply.
>> 
>> The latter is actually what my question is about, along with the
>> question on how to get R to treat two columns as two instances of the
>> same factor.
>> 
>> 
>> Just to be sure I used R to check if the data really counts as a
>> factor according to R-terminology. It really is a factor, see code
>> below.
>> 
>> 
>> 
>> This is the code for generating the example-data:
>> 
>> # --------------------------------------------------------------- #
>> pnames    = c( "alice" , "bob" , "charlie" , "don" , "eve" , "freddy"
>> , "grace" , "henry" )
>> pcount    = length( pnames )
>> 
>> # abilities = runif( pcount )
>> abilities = (1:pcount) / 10
>> 
>> persons = data.frame( name = pnames , ability = abilities )
>> persons
>> 
>> # random subset of possible combinations and extra df
>> combinations = combn( nrow( persons ) , 2 ) ;
>> combinations = cbind( combinations,combinations,combinations,combinations )
>> combinations = combinations[ , runif(ncol(combinations))<0.5 ]
>> ccount = ncol( combinations )
>> 
>> observed_data = data.frame(
>> idx1 = combinations[1,]
>> , idx2 = combinations[2,]
>> , p1 = ( persons$name[    combinations[1,] ] )
>> , p2 = ( persons$name[    combinations[2,] ] )
>> )
>> 
>> abilities_data = data.frame(
>> a1 = persons$ability[ combinations[1,] ]
>> , a2 = persons$ability[ combinations[2,] ]
>> )
>> 
>> # y = result of cooperation of each pair
>> multiplyer = runif(1) + 1
>> offset     = 1
>> cat( "multiplyer = " , multiplyer , "\n" )
>> cat( "offset = " , offset , "\n" )
>> 
>> y0 = multiplyer * ( offset - ( abilities_data$a1 - abilities_data$a2 ) ^ 2 )
>> noise = .05 * rnorm( ccount )
>> 
>> # check variables are really factors :
>> str(  observed_data$p1 )
>> dput( observed_data$p1 )
>> 
>> observed_data = data.frame( y = round( y0+noise,3 ) , observed_data )
>> observed_data
>> 
>> # --------------------------------------------------------------- #
> 
> Is this what is intended?
> 
>> observed_data$p1ab <- persons$ability[ match(observed_data$p1, persons$name) 
>> ]
>> observed_data$p2ab <- persons$ability[ match(observed_data$p2, persons$name) 
>> ]
>> head(observed_data)
>      y idx1 idx2    p1      p2 p1ab p2ab
> 1 1.149    1    6 alice  freddy  0.1  0.6
> 2 1.006    1    7 alice   grace  0.1  0.7
> 3 1.529    2    3   bob charlie  0.2  0.3
> 4 1.404    2    5   bob     eve  0.2  0.5
> 5 1.205    2    6   bob  freddy  0.2  0.6
> 6 1.187    2    7   bob   grace  0.2  0.7
> 
> 
>> lm( y ~ I( (p1ab -p2ab)^2 ), data=observed_data)
> 
> Call:
> lm(formula = y ~ I((p1ab - p2ab)^2), data = observed_data)
> 
> Coefficients:
>       (Intercept)  I((p1ab - p2ab)^2)  
>             1.506              -1.435  
> 
>> separate_term <- lm( y ~ I( (p1ab -p2ab)^2 ), data=observed_data)
>> summary(separate_term)
> 
> Call:
> lm(formula = y ~ I((p1ab - p2ab)^2), data = observed_data)
> 
> Residuals:
>      Min        1Q    Median        3Q       Max 
> -0.116249 -0.030996  0.002633  0.032765  0.136282 
> 
> Coefficients:
>                   Estimate Std. Error t value Pr(>|t|)    
> (Intercept)         1.50589    0.01067  141.08   <2e-16 ***
> I((p1ab - p2ab)^2) -1.43527    0.05863  -24.48   <2e-16 ***
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 
> Residual standard error: 0.05304 on 44 degrees of freedom
> Multiple R-squared:  0.9316,  Adjusted R-squared:   0.93 
> F-statistic: 599.2 on 1 and 44 DF,  p-value: < 2.2e-16
> 
> You could also have compared 2 models differing only with rest to the 
> includion of an interaction term that was the squared difference in abilities:
> 
>> full <- lm( y ~ p1ab + p2ab + I( (p1ab -p2ab)^2 ), data=observed_data)
>> reduced <- lm( y ~ p1ab + p2ab , data=observed_data)
>> anova(full,reduced)
> Analysis of Variance Table
> 
> Model 1: y ~ p1ab + p2ab + I((p1ab - p2ab)^2)
> Model 2: y ~ p1ab + p2ab
>  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
> 1     42 0.11823                                  
> 2     43 0.17315 -1  -0.05492 19.509 6.892e-05 ***


I also tested whether "nesting" the I( ) calls would succeed in giving you any 
results from what I thought was a rather odd  construction

>>>> y = f * ( o - ( p1 - p2 )^2 )

> separate_term <- lm( y ~ I( (p1ab -p2ab)^2 ), data=observed_data)
> separate_term

Call:
lm(formula = y ~ I((p1ab - p2ab)^2), data = observed_data)

Coefficients:
       (Intercept)  I((p1ab - p2ab)^2)  
             1.506              -1.435  

> separate_term2 <- lm( y ~ I( 1- I( (p1ab -p2ab)^2 )), data=observed_data)
> separate_term2

Call:
lm(formula = y ~ I(1 - I((p1ab - p2ab)^2)), data = observed_data)

Coefficients:
              (Intercept)  I(1 - I((p1ab - p2ab)^2))  
                  0.07062                    1.43527  


The sign of the "interaction term" just gets reversed. I don't see that any 
offset term got calculated for the "1" term. Offsets can be included but I 
think this particular form where you are asking for two different coefficients, 
they might not be separable in the linear model framework. 

-- 
david.


> 
> 
> -- 
> David
> 
>> 
>> 
>> 2016-07-11 19:16 GMT+02:00 Jeff Newmiller <jdnew...@dcn.davis.ca.us>:
>>> Your clarification is promising.  A reproducible example is always 
>>> preferred, though never a guarantee. I expect to be somewhat preoccupied 
>>> this week so responses may be rather delayed, but the less setup we have to 
>>> the more likely that someone on the list will tackle it.
>>> 
>>> Re an answer: If you can make the example simple enough that you can tell 
>>> us what the right numerical result will be, we will have a better chance of 
>>> understanding what you are after.  E.g. if you start with a solution and 
>>> use it to create sample input data with then you don't need to actually 
>>> solve it to illustrate what you are after. [1]
>>> 
>>> Note that I am not aware of any package dedicated to this type of problem, 
>>> so unless someone else responds otherwise then you will likely have to use 
>>> bootstrapping or your own statistical analysis (Bayesian?) of the result.
>>> 
>>> [1] 
>>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>>> --
>>> Sent from my phone. Please excuse my brevity.
>>> 
>>> On July 11, 2016 7:28:41 AM PDT, stn021 <stn...@gmail.com> wrote:
>>>> Hello,
>>>> 
>>>> thank you for the replies. Sorry about the html-email, I forgot.
>>>> Should be OK with this email.
>>>> 
>>>> 
>>>> Don't be fooled be the apparent simplicity of the problem. I have
>>>> tried to reduce it to only a single relatively simple question.
>>>> 
>>>> The idea here is to model cooperation of two persons. The model is
>>>> about one specific aspect of that cooperation, namely that two persons
>>>> with similar abilities may be able to produce better results that two
>>>> very different persons.
>>>> 
>>>> That is only one part of the model with other parts modeling for
>>>> example the fact that of course two persons with a higher degree of
>>>> ability will produce better results per se.
>>>> 
>>>> 
>>>> It is not classic regression with factors. That can be easily done by
>>>> something like lm( y ~ (p1-p2)^2 ).
>>>> 
>>>> This expands to lm( y ~ p1^2 - 2*p1*p2 + p2^2 ). This contains a
>>>> multiplicagtions and for lm() this implies interactions between the
>>>> factor-levels and produces one parameter for each combination of
>>>> factor-levels that occurs in the data. That is not what the question
>>>> is about.
>>>> 
>>>> Also p1 and p2 are different levels of the same factor, while for lm()
>>>> it would be two different factors with different levels.
>>>> 
>>>> 
>>>> As for the sensical part: this has a real world application therefore
>>>> it makes sense.
>>>> 
>>>> Also it is not so difficult to solve with non-linear optimization. I
>>>> was hoping to be able to use R for that purpose because then the
>>>> results could easily be checked with statistical tests.
>>>> 
>>>> So my question is not "how to solve" but "how to solve with R".
>>>> 
>>>> 
>>>> As for the excess degrees of freedom, in real observations there would
>>>> of course be added noise due to either random variations or factors
>>>> not included in the model. So to generate a more reality-conforming
>>>> example I could add some random normal-distributed noise to the
>>>> dependent variable y. I previously left that part out because to me it
>>>> did not seem relevant.
>>>> 
>>>> 
>>>> Would you like me to make a complete example dataset with more records
>>>> and noise ?
>>>> 
>>>> 
>>>> The answer I look for would be the numerical values of the
>>>> factor-levels and numerical values for the multiplier (f) and the
>>>> offset (o), with p1 and p2 given as names (here: persons) and y given
>>>> as some level of achievement they reach by cooperating.
>>>> 
>>>> y = f * ( o - ( p1 - p2 )^2 )
>>>> 
>>>> Is that what you meant by "answer" ?
>>>> 
>>>> 
>>>> THX
>>>> stefan
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 2016-07-10 2:27 GMT+02:00 Jeff Newmiller <jdnew...@dcn.davis.ca.us>:
>>>>> 
>>>>> I have seen less sensical questions.
>>>>> 
>>>>> It would be nice if the example were a bit more complete (as in it
>>>> should have excess degrees of freedom and an answer) and less like a
>>>> homework problem (which are off topic here). It would of course also be
>>>> helpful if the OP were to conform to the Posting Guide, particularly in
>>>> respect to using plain text email.
>>>>> 
>>>>> It looks like the kind of nonlinear optimization problem that
>>>> evolutionary algorithms are often applied to. It doesn't look (to me)
>>>> like a typical problem that factors get applied to in formulas though,
>>>> because multiple instances of the same factor variable are present.
>>>>> --
>>>>> Sent from my phone. Please excuse my brevity.
>>>>> 
>>>>> On July 9, 2016 4:59:30 PM PDT, Rolf Turner <r.tur...@auckland.ac.nz>
>>>> wrote:
>>>>>> On 09/07/16 20:52, stn021 wrote:
>>>>>>> Hello,
>>>>>>> 
>>>>>>> I would like to analyse a model like this:
>>>>>>> 
>>>>>>> y = 1 *  ( 1 - ( x1 - x2 )  ^ 2   )
>>>>>>> 
>>>>>>> x1 and x2 are not continuous variables but factors, so the
>>>>>> observation
>>>>>>> contain the level.
>>>>>>> Its numerical value is unknown and is to be estimated with the
>>>> model.
>>>>>>> 
>>>>>>> 
>>>>>>> The observations look like this:
>>>>>>> 
>>>>>>> y        x1     x2
>>>>>>> 0.96  Alice  Bob
>>>>>>> 0.84  Alice  Charlie
>>>>>>> 0.96  Bob   Charlie
>>>>>>> 0.64  Dave Alice
>>>>>>> etc.
>>>>>>> 
>>>>>>> Each person has a numerical value. Here for example Alice = 0.2
>>>> and
>>>>>> Bob =
>>>>>>> 0.4
>>>>>>> 
>>>>>>> Then y = 0.96 = 1* ( 1- ( 0.2-0.4 ) ^ 2 ) , see first observation.
>>>>>>> 
>>>>>>> How can this be done in R ?
>>>>>> 
>>>>>> 
>>>>>> This question makes about as little sense as it is possible to
>>>> imagine.
>>>>>> 
>>>>>> cheers,
>>>>>> 
>>>>>> Rolf Turner
>>>>> 
>>> 
>> 
>> ______________________________________________
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> David Winsemius
> Alameda, CA, USA
> 
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [FORGED] Regression with factors ?

Reply via email to