Thanks for your suggestions folks. I made some progress. In my dataset PRODUCT: Y - dependent variable (NUMBER_OF_MONTHS- it can be any positive integer) SEGMENT - is the categorical independent variable (takes values 01,02....60) STATUS - is the indicator for censoring: 1-censor; 0-uncensor. PURCHASE_DT - the date when customer purchased the product. CANCEL_DT - the date when customer canceled the product and missing-value if customer has not canceled yet. CANCEL - same date if we have cancel_dt and if missing then we give it todays date. *The censored obs are all RIGHT-CENSORED*
/* My sas code */ DATA PRODUCT; SET PRODUCT; IF CANCEL_DT EQ . THEN CANCEL=TODAY(); ELSE CANCEL = CANCEL_DT; IF CANCEL_DT EQ . THEN STATUS=1; ELSE STATUS=0; /* Status=0 means CANCELLED, 1 means NOT CANCELLED*/ Y = INTCK('MONTH' , PURCHASE_DT, CANCEL); FORMAT CANCEL DATE9. ; if Y ge 3; RUN; PROC LIFEREG data=PRODUCT; CLASS SEGMENT; MODEL Y*STATUS(1)=SEGMENT / DIST= weibull; OUTPUT OUT=PROD_OUT P=PREDICTED_Y; RUN; QUIT; /**********/ My questions are: 1. How to decide which distribution to use. I tried exponential, weibull, normal etc. 2. About 75% of my obs. are censored (there are in total 110,000 obs in my dataset-PRODUCT) 3. The PREDICTED_Y are really huge numbers like 170, 200 etc. which are above what I expected. I am also suspecting if this is due to large no. of right-censored obs in my dataset. I have heared that-huge censoring can lead to highly extrapolated predictions. Is there a way to handling such censoring problems. Also,is it really a problem or it's ok to have this kind of situation? 4. If anybody knows of any better way of getting predicted_y, or different ways of analysis, please let me know. Thanks a lot, AJ . . ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at: . http://jse.stat.ncsu.edu/ . =================================================================