i just suspect there must have been some research or study done in
terms of how accurate factorization problems are on a sumbsample.
Similar to standard errors and confidence intervals. e.g. i know how
many samples i need to fit observed mean into certain confidence
interval provided i know original distribution . So similar estimate
is sought for a factorization problem, assuming some standard mixture
model.

-d

On Fri, Dec 16, 2011 at 10:56 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> the problem is convex but the idea is not to use a map reduce but a
> subsample and solve it in memory on a reduced sample (i was actually
> thinking of simple bisect rather than trying to fit to anything), but
> that's not the point .
>
> the point is how accurate the solution for a random subsample would
> reflect the actual optimum on the whole.
>
>
>
> On Fri, Dec 16, 2011 at 10:50 AM, Raphael Cendrillon
> <cendrillon1...@gmail.com> wrote:
>> Hi Dmitry,
>>
>> I have a feeling the objective may be very close to convex. In that case 
>> there are faster approaches than random subsampling.
>>
>> A common strategy for example is to fit a quadratic onto the previously 
>> evaluated lambda values, and then solve it for the minimum.
>>
>> This is an iterative approach, so wouldn't fit well within map reduce, but 
>> if you are thinking of doing this as a preprocessing step it would be OK.
>>
>> On Dec 16, 2011, at 10:05 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I remember vaguely the discussion of finding the optimum for reg rate
>>> in ALS-WR stuff.
>>>
>>> Would it make sense to take a subsample (or, rather, a random
>>> submatrix) of the original input and try to find optimum for it
>>> somehow, similar to total order paritioner's distribution sampling?
>>>
>>> I have put ALS with regularization and ALS-WR  (and will put the
>>> implicit feedback paper as well) into R code and i was wondering if it
>>> makes sense to find a better guess for lambda by just doing an R
>>> simulation on a randomly subsampled data before putting it into
>>> pipeline? or there's a fundamental problem with this approach?
>>>
>>> Thanks.
>>> -Dmitriy

Reply via email to