Re: Regression+SGD question

Dmitriy Lyubimov Sat, 04 Sep 2010 13:29:21 -0700

Thank you, Ted, this is very instructive.

There's something i don't understand about your derivation .




I think Bishop  generally suggests that in linear regression y=beta_0 +
<beta, x> (so there's an intercept)
and i think he uses similar approach with fitting to logistic function where
i think he suggests to use P( [mu + <beta,x>]/s )
which of course can be thought of again as P(beta_0+<beta,x>)

but if there's no intercept beta_0, then y(x=(0,...0)^T | beta)  is always
0. Which is not true of course in most situations. Does your method imply
that having trivial input (all 0s ) would produce 0 estimation?

Second question, are the betas allowed to go negative?

Thank you, sir.

-Dmitriy

On Sat, Jul 10, 2010 at 10:36 AM, Ted Dunning <[email protected]> wrote:

> On Sat, Jul 10, 2010 at 5:26 AM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > But what if the variable being regressed has signficant variance?
> >
>
> for generalized regression in general, the regressor converges to the
> expectation, pretty much as you suggest:
>
>    E[y] = g^{-1}(\beta x)
>
> The link function g determines whether the regression is logistic
> regression
> or least squares or Poisson.
>
>
> >
> > Say in such a heartbreaking example where people are coming into store,
> > some
> > of them end up buying something for some $$ but most don't buy anything
> > (sale $=0). Suppose i use SGD regression to regress the sale $ using
> bunch
> > of individual sale regressors (such as person's profile, store
> theme/focus
> > etc.)
> >
>
> This kind of mixed discrete/continuous problem is often best attacked by
> factoring it.  First model p($>0) using logistic regression (or whatever
> binary regression technique is fashionable/effective).  Then model the
> (nearly) continuous distribution p($ | $ > 0).
>
> The rationale here is that you often get a better result from this
> composite
> model than with the model that models both steps at once.  For instance, in
> one case I have seen, p($ | $ > 0) was essentially trivial because there
> was
> very good knowledge about what the person was likely to be based on which
> ad
> they clicked.  Combining the models, however, increased the dimensionality
> enough to make the p($) model significantly harder to learn than p($ | $>0)
> or p($>0).
>
>
>
> >
> > Obviously this regressand has  a very high variance... But... If i can
> hope
> > to converge on the math expectancy of the sale, then i would be able to
> > predict say daily sales for individual stores based on amount of people
> > visited per day --or for that matter whatever interval as long as we know
> > how many people were there (which basically makes my manager happy for
> the
> > moment). Another thing is that i want to try to come up with E(sale) for
> > every new person coming into a store before he or she makes any deals
> based
> > on various regressors such as person profile, store focus etc.
> >
>
> This sounds like you are 90% there.
>
>
> >
> >
> > So intuitively i feel that  that SGD must converge on the E(regressand)
> in
> > cases where  variance(regressand) is quite high as SGD basically
> minimizes
> > RMSE (which is essentially same as the variance). Is that correct? But i
> am
> > not quite sure if that is backed by the math of stochastic gradient
> > descent.
> >
>
> Yes.  For convex loss functions, SGD converges toward the MLE estimate.
>
> This isn't quite the same as minimum squared error, but your intuitions are
> going the right direction.
>
> Another question is would there be difference between the cases of SGD+MLE
> > vs. SGD+least squares methods for high variance regressands?
> >
>
> Yes.  There is a difference, but in practice it isn't that big a deal.
>  Vowpal wabbit uses RMSE as a loss function by default and simply limits
> the
> output value to the 0,1 range.  This works quite well.   Mahouts SGD uses
> the MLE of logistic regression.  That also works well.  I will be posting
> an
> updated patch today that does confidence weighted learning which
> considerably improves convergence time.
>

Re: Regression+SGD question

Reply via email to