i just suspect there must have been some research or study done in terms of how accurate factorization problems are on a sumbsample. Similar to standard errors and confidence intervals. e.g. i know how many samples i need to fit observed mean into certain confidence interval provided i know original distribution . So similar estimate is sought for a factorization problem, assuming some standard mixture model.
-d On Fri, Dec 16, 2011 at 10:56 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > the problem is convex but the idea is not to use a map reduce but a > subsample and solve it in memory on a reduced sample (i was actually > thinking of simple bisect rather than trying to fit to anything), but > that's not the point . > > the point is how accurate the solution for a random subsample would > reflect the actual optimum on the whole. > > > > On Fri, Dec 16, 2011 at 10:50 AM, Raphael Cendrillon > <cendrillon1...@gmail.com> wrote: >> Hi Dmitry, >> >> I have a feeling the objective may be very close to convex. In that case >> there are faster approaches than random subsampling. >> >> A common strategy for example is to fit a quadratic onto the previously >> evaluated lambda values, and then solve it for the minimum. >> >> This is an iterative approach, so wouldn't fit well within map reduce, but >> if you are thinking of doing this as a preprocessing step it would be OK. >> >> On Dec 16, 2011, at 10:05 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: >> >>> Hi, >>> >>> I remember vaguely the discussion of finding the optimum for reg rate >>> in ALS-WR stuff. >>> >>> Would it make sense to take a subsample (or, rather, a random >>> submatrix) of the original input and try to find optimum for it >>> somehow, similar to total order paritioner's distribution sampling? >>> >>> I have put ALS with regularization and ALS-WR (and will put the >>> implicit feedback paper as well) into R code and i was wondering if it >>> makes sense to find a better guess for lambda by just doing an R >>> simulation on a randomly subsampled data before putting it into >>> pipeline? or there's a fundamental problem with this approach? >>> >>> Thanks. >>> -Dmitriy