Hi
 
I am working on corpora of automatically recognized utterances, looking
for features that predict error in the hypothesis the recognizer is
proposing.  
 
I am using the glm functions to do logistic regression.  I do this type
of thing:
 
*       logistic.model = glm(formula = similarity ~., family = binomial,
data = data)
 
and end up with a model:
 
> summary(logistic.model)
 
Call:
glm(formula = similarity ~ ., family = binomial, data = data)
 
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.1599   0.2334   0.3307   0.4486   1.2471  
 
Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)           11.1923783  4.6536898   2.405  0.01617 *  
length                -0.3529775  0.2416538  -1.461  0.14410    
meanPitch             -0.0203590  0.0064752  -3.144  0.00167 ** 
minimumPitch           0.0257213  0.0053092   4.845 1.27e-06 ***
maximumPitch          -0.0003454  0.0030008  -0.115  0.90838    
meanF1                 0.0137880  0.0047035   2.931  0.00337 ** 
meanF2                 0.0040238  0.0041684   0.965  0.33439    
meanF3                -0.0075497  0.0026751  -2.822  0.00477 ** 
meanF4                -0.0005362  0.0007443  -0.720  0.47123    
meanF5                -0.0001560  0.0003936  -0.396  0.69187    
ratioF2ToF1            0.2668678  2.8926149   0.092  0.92649    
ratioF3ToF1            1.7339087  1.7655757   0.982  0.32607    
jitter                -5.2571384 10.8043359  -0.487  0.62656    
shimmer               -2.3040826  3.0581950  -0.753  0.45120    
percentUnvoicedFrames  0.1959342  1.3041689   0.150  0.88058    
numberOfVoiceBreaks   -0.1022074  0.0823266  -1.241  0.21443    
percentOfVoiceBreaks  -0.0590097  1.2580202  -0.047  0.96259    
meanIntensity         -0.0765124  0.0612008  -1.250  0.21123    
minimumIntensity       0.1037980  0.0331899   3.127  0.00176 ** 
maximumIntensity      -0.0389995  0.0430368  -0.906  0.36484    
ratioIntensity        -2.0329346  1.2420286  -1.637  0.10168    
noSyllsIntensity       0.1157678  0.0947699   1.222  0.22187    
startSpeech            0.0155578  0.1343117   0.116  0.90778    
speakingRate          -0.2583315  0.1648337  -1.567  0.11706    
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 
 
(Dispersion parameter for binomial family taken to be 1)
 
    Null deviance: 2462.3  on 4310  degrees of freedom
Residual deviance: 2209.5  on 4287  degrees of freedom
AIC: 2257.5
 
Number of Fisher Scoring iterations: 6
 
 
I have seen models where almost all the features are showing one in a
thousand significance but I accept that I could improve my model by
normalizing some of the features (some are left skewed and I understand
that I will get a better fir by taking their logs, for example).
 
What really worries me is that the logistic function produces
predictions that appear to fall well outside 0 to 1.
 
If I make a dataset of the medians of the above features and use my
logistic.model on it, it produces a 
figure of:
 
> x = predict(logistic.model, medians)
> x
[1] 2.82959
>
 
which is well outside the range of 0 to 1.
 
The actual distribution of all the predictions is:
 
> summary(pred)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -1.516   2.121   2.720   2.731   3.341   6.387 
> 
 
I can get the model to give some sort of prediction by doing this:
 
> pred = predict(logistic.model, data)
> pred[pred <= 1.5] = 0 
> pred[pred > 1.5] = 1 
> t = table(pred, data[,24])
> t
    
pred 0    1   
   0  102  253
   1  255 3701
> 
> classAgreement(t)
$diag
[1] 0.8821619
 
$kappa
[1] 0.2222949
 
$rand
[1] 0.7920472
 
$crand
[1] 0.1913888
 
>
 
but as you can see I am using a break point well outside the range 0 to
1 and the kappa is rather low (I think).
 
I am a bit of a novice in this, and the results worry me.  
 
Can anyone comment if the results look strange, or if they know I am
doing something wrong?
 
Stephen
 

-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.

 

        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Reply via email to