Hi Cory,

The lack of sample_weight support in sparse solvers is a known issue, see
https://github.com/scikit-learn/scikit-learn/issues/1190

In the meantime, I see two solutions. As described in the above issue, one
solution is to multiply each x_i and y_i in your training set by the square
root of its sample weight. This will be exactly equivalent to using sample
weights and will allow you to use fast sparse solvers like "sparse_cg" or
"lsqr". The second solution is to use SGDRegressor(loss="squared"), which
should readily support sample_weight.

HTH,
Mathieu


On Wed, Apr 2, 2014 at 9:18 AM, Cory Dolphin <[email protected]> wrote:

> Hello,
>
> I am trying to perform ridge regression on a relatively large data set 70
> million examples 24 million very sparse features.
>
> E.G. I have created an X matrix with dimensions (73725855, 24652292), an
> associated y vector with dimensions (73725855,), and a sample_weights
> vector with identical dimensions ((73725855,)).
>
> In this case, the y vector is a rating, and the sample_weights describe
> how many times a given rating occurred.
>
> I need to use one of the sparse solvers, as the data set does not fit in
> memory as a dense matrix, however it seems that all of the sparse solvers
> do not accept a sample_weights vector.
>
> Does anyone have experience with weighted ridge regression on large sparse
> matrices?
>
>
> I am new to the world of machine learning, so please forgive me for any
> vocabulary mistakes!
>
> Thanks,
> Cory
>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to