Hi-

This appears to be a bug in the PSPP regression routine with data with a large amount of missing values!

I recently noticed some small discrepancies between simple bivariate regression results between IBM SPSS, STATA and PSPP. Until Prof. Shackman's email, I hadn't realized that the discrepancies only occur when there are many missing values. I was just confused...

Sadly, I also find problems when running linear regressions using PSPP on data with missing values. I wish I knew what was causing the problem.

So, using Dropbox, I wanted to make available some data which seems to illustrate the issue.

Using psppire.exe 0.7.9-gab8ce2 on Windows AND psppire 0.7.8 on LinuxMint LXDE, PSPP calculates descriptive statistics just like SPSS and STATA on the same dataset, but does not calculate identical b coefficients when running bivariate or multivariate regressions.

I created the following public opinion survey data files consisting of three variables from the 2004 Canadian Election Study which I recoded and declared certain values to be missing: http://dl.dropbox.com/u/35198072/ces2004-regtest.sav <http://www.queensu.ca/cora/ces.html> has many observations with missing values. http://dl.dropbox.com/u/35198072/ces2004-regtest2.sav has the same three variables, but I dropped all of the cases with missing values.

This is the syntax file used to run descriptive statistics and three regression analyses.
http://dl.dropbox.com/u/35198072/regression-tests.sps

PSPP generates these regression results and descriptive statistics with missing values:
http://dl.dropbox.com/u/35198072/regression-test-pspp1.html
PSPP generates these regression results and descriptive statistics using the data without any missing values:
http://dl.dropbox.com/u/35198072/regression-test-pspp2.html

Here is the STATA output on the same output (.log is a text file - email me if you have a problem opening it). The first three regressions should match the output in regression-test-pspp1.html They are close, but not close enough... The bottom three regressions use the data with no missing values and these DO match PSPP's output (in regression-test-pspp2.html).
http://dl.dropbox.com/u/35198072/regression-test-stata.log

I also ran the data on SPSS and found results consistent with STATA. There did not seem to be any problems with Pearson's Chi-Square or Kendall's Tau-B when running a crosstab on the data with the missing values.

I am sorry I don't know what has gone wrong, so I am making available this data in hopes someone might figure out where there is a mistake. I caution other users running regression on PSPP.

Yours,
Renan

On 04-Mar-12 11:37 PM, Gene Shackman wrote:
Hi

I'm using the windows version, psppire.exe 0.7.8-g997322, that I downloaded from
http://www.gnu.org/software/pspp/get.html
I'm using windows vista, home version.

My question is about linear regression. If I use data that has no missing values, then PSPP regression seems to work fine. I compared the results with other packages and got the same results, see http://gsociology.icaap.org/methods/comparing_freestaprograms.html However, if I use data that does have missing values, I get results that are different from other programs. See the results from other programs here
http://gsociology.icaap.org/methods/comparing_freestaprograms_missing.html
this also lists the data set I'm using, and attached below are the results I get from PSPP (If you format this as courier, it aligns up right.)

So 2 questions
1. How does pspp deal with missing? By the way, I tried coding blanks as missing and also tried replacing all the missing values with -99999 and told pspp those were missing values, and got exactly the same results.

2. There don't appear to be any options on how regression is done, like forward, backward, forced, etc. I didn't see anything in the documentation about it either. Is it just doing straight forced regression? Will there be any options on how to do regression?

Thanks very much.

REGRESSION
    /VARIABLES= c_arable climate North phone_kpop
    /DEPENDENT=     gini
    /STATISTICS=COEFF R ANOVA.

Model Summary
#====#========#=================#==========================#
#  R #R Square|Adjusted R Square|Std. Error of the Estimate#
##===#========#=================#==========================#
#|.60#     .36|              .35|                      8.65#
##===#========#=================#==========================#

ANOVA
#===========#==============#===#===========#=====#============#
#           #Sum of Squares| df|Mean Square|  F  |Significance#
##==========#==============#===#===========#=====#============#
#|Regression#       4548.35|  4|    1137.09|15.19|         .00#
#|Residual  #       7933.89|106|      74.85|     |            #
#|Total     #      12482.24|110|           |     |            #
##==========#==============#===#===========#=====#============#

Coefficients
#===========#=====#==========#====#=====#============#
#           #  B  |Std. Error|Beta|  t  |Significance#
##==========#=====#==========#====#=====#============#
#|(Constant)#47.95|      2.06| .00|23.22|         .00#
#| c_arable # -.12|       .05|-.20|-2.28|         .02#
#|  climate #-1.24|      1.04|-.11|-1.20|         .23#
#|   North  # -.14|       .03|-.43|-4.96|         .00#
#|phone_kpop#  .00|       .00|-.07| -.81|         .42#
##==========#=====#==========#====#=====#============#




Gene



Gene Shackman, Ph.D.
The Global Social Change Research Project
http://gsociology.icaap.org
Free Resources for Methods in Evaluation and Social Research
http://gsociology.icaap.org/methods
----------
Applied Sociologist
----------



_______________________________________________
Pspp-users mailing list
Pspp-users@gnu.org
https://lists.gnu.org/mailman/listinfo/pspp-users

--
Renan Levine
Department of Political Science
University of Toronto - Scarborough
renan.lev...@utoronto.ca
http://individual.utoronto.ca/renan
(416) 208-2651

_______________________________________________
Pspp-users mailing list
Pspp-users@gnu.org
https://lists.gnu.org/mailman/listinfo/pspp-users

Reply via email to