I generally add in the constant term to the feature vector if I want to use
it.  You are correct that it is usually critical to correct function, but I
prefer to not have a special case for it.  The one place where I think that
is wrong is where you want to have special treatment by the prior.  It is
common to have a very different prior on the intercept than on the
coefficients.  My only defense there is that common priors for the
coefficients like L1 allow for plenty of latitude on the intercept so that
as long as the data outweigh the prior, this doesn't matter.  There is a
similar distinctive effect between interactions and main effects.

One place it would matter a lot is in multi-level inference where you wind
up with a pretty strong prior from the higher level regressions (since that
is where most of the data actually is).  In that case, I would probably
rather separate the handling.  In fact, at that point, I think I would
probably go with a grouped prior to allow handling all of these cases in a
coherent setting.

On the second question, betas can definitely go negative.  That is how the
model expresses an effect that decreases the likelihood of success.

On Sat, Sep 4, 2010 at 1:28 PM, Dmitriy Lyubimov <[email protected]> wrote:

> There's something i don't understand about your derivation .
>
>
>
> I think Bishop  generally suggests that in linear regression y=beta_0 +
> <beta, x> (so there's an intercept)
> and i think he uses similar approach with fitting to logistic function
> where
> i think he suggests to use P( [mu + <beta,x>]/s )
> which of course can be thought of again as P(beta_0+<beta,x>)
>
> but if there's no intercept beta_0, then y(x=(0,...0)^T | beta)  is always
> 0. Which is not true of course in most situations. Does your method imply
> that having trivial input (all 0s ) would produce 0 estimation?
>
> Second question, are the betas allowed to go negative?
>

Reply via email to