I generally add in the constant term to the feature vector if I want to use it. You are correct that it is usually critical to correct function, but I prefer to not have a special case for it. The one place where I think that is wrong is where you want to have special treatment by the prior. It is common to have a very different prior on the intercept than on the coefficients. My only defense there is that common priors for the coefficients like L1 allow for plenty of latitude on the intercept so that as long as the data outweigh the prior, this doesn't matter. There is a similar distinctive effect between interactions and main effects.
One place it would matter a lot is in multi-level inference where you wind up with a pretty strong prior from the higher level regressions (since that is where most of the data actually is). In that case, I would probably rather separate the handling. In fact, at that point, I think I would probably go with a grouped prior to allow handling all of these cases in a coherent setting. On the second question, betas can definitely go negative. That is how the model expresses an effect that decreases the likelihood of success. On Sat, Sep 4, 2010 at 1:28 PM, Dmitriy Lyubimov <[email protected]> wrote: > There's something i don't understand about your derivation . > > > > I think Bishop generally suggests that in linear regression y=beta_0 + > <beta, x> (so there's an intercept) > and i think he uses similar approach with fitting to logistic function > where > i think he suggests to use P( [mu + <beta,x>]/s ) > which of course can be thought of again as P(beta_0+<beta,x>) > > but if there's no intercept beta_0, then y(x=(0,...0)^T | beta) is always > 0. Which is not true of course in most situations. Does your method imply > that having trivial input (all 0s ) would produce 0 estimation? > > Second question, are the betas allowed to go negative? >
