Thank you, Ted, this is very instructive. There's something i don't understand about your derivation .
I think Bishop generally suggests that in linear regression y=beta_0 + <beta, x> (so there's an intercept) and i think he uses similar approach with fitting to logistic function where i think he suggests to use P( [mu + <beta,x>]/s ) which of course can be thought of again as P(beta_0+<beta,x>) but if there's no intercept beta_0, then y(x=(0,...0)^T | beta) is always 0. Which is not true of course in most situations. Does your method imply that having trivial input (all 0s ) would produce 0 estimation? Second question, are the betas allowed to go negative? Thank you, sir. -Dmitriy On Sat, Jul 10, 2010 at 10:36 AM, Ted Dunning <[email protected]> wrote: > On Sat, Jul 10, 2010 at 5:26 AM, Dmitriy Lyubimov <[email protected]> > wrote: > > > But what if the variable being regressed has signficant variance? > > > > for generalized regression in general, the regressor converges to the > expectation, pretty much as you suggest: > > E[y] = g^{-1}(\beta x) > > The link function g determines whether the regression is logistic > regression > or least squares or Poisson. > > > > > > Say in such a heartbreaking example where people are coming into store, > > some > > of them end up buying something for some $$ but most don't buy anything > > (sale $=0). Suppose i use SGD regression to regress the sale $ using > bunch > > of individual sale regressors (such as person's profile, store > theme/focus > > etc.) > > > > This kind of mixed discrete/continuous problem is often best attacked by > factoring it. First model p($>0) using logistic regression (or whatever > binary regression technique is fashionable/effective). Then model the > (nearly) continuous distribution p($ | $ > 0). > > The rationale here is that you often get a better result from this > composite > model than with the model that models both steps at once. For instance, in > one case I have seen, p($ | $ > 0) was essentially trivial because there > was > very good knowledge about what the person was likely to be based on which > ad > they clicked. Combining the models, however, increased the dimensionality > enough to make the p($) model significantly harder to learn than p($ | $>0) > or p($>0). > > > > > > > Obviously this regressand has a very high variance... But... If i can > hope > > to converge on the math expectancy of the sale, then i would be able to > > predict say daily sales for individual stores based on amount of people > > visited per day --or for that matter whatever interval as long as we know > > how many people were there (which basically makes my manager happy for > the > > moment). Another thing is that i want to try to come up with E(sale) for > > every new person coming into a store before he or she makes any deals > based > > on various regressors such as person profile, store focus etc. > > > > This sounds like you are 90% there. > > > > > > > > So intuitively i feel that that SGD must converge on the E(regressand) > in > > cases where variance(regressand) is quite high as SGD basically > minimizes > > RMSE (which is essentially same as the variance). Is that correct? But i > am > > not quite sure if that is backed by the math of stochastic gradient > > descent. > > > > Yes. For convex loss functions, SGD converges toward the MLE estimate. > > This isn't quite the same as minimum squared error, but your intuitions are > going the right direction. > > Another question is would there be difference between the cases of SGD+MLE > > vs. SGD+least squares methods for high variance regressands? > > > > Yes. There is a difference, but in practice it isn't that big a deal. > Vowpal wabbit uses RMSE as a loss function by default and simply limits > the > output value to the 0,1 range. This works quite well. Mahouts SGD uses > the MLE of logistic regression. That also works well. I will be posting > an > updated patch today that does confidence weighted learning which > considerably improves convergence time. >
