Have you used sparse arrays? On Fri, Jun 2, 2017 at 7:39 PM, Stuart Reynolds <stu...@stuartreynolds.net> wrote:
> Hmmm... is it possible to place your original data into a memmap? > (perhaps will clear out 8Gb, depending on SGDClassifier internals?) > > https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html > https://stackoverflow.com/questions/14262433/large-data- > work-flows-using-pandas > > - Stuart > > On Fri, Jun 2, 2017 at 10:30 AM, Sebastian Raschka <se.rasc...@gmail.com> > wrote: > > I also think that this could be likely a memory related issue. I just > ran the following snippet in a Jupyter Nb: > > > > import numpy as np > > from sklearn.linear_model import SGDClassifier > > > > model = SGDClassifier(loss='log',penalty=None,alpha=0.0, > > l1_ratio=0.0,fit_intercept=False,n_iter=1,shuffle=False, > learning_rate='constant', > > eta0=1.0) > > > > X = np.random.random((1000000, 1000)) > > y = np.zeros(1000000) > > y[:1000] = 1 > > > > model.fit(X, y) > > > > > > > > The dataset takes approx. 8 Gb, but the model fitting is consuming ~16 > Gb -- probably due to making a copy of the X array in the code. The > Notebook didn't crash but I think on machines with smaller RAM, this could > be an issue. One workaround you could try is to fit the model iteratively > using partial_fit. For example, 1000 samples at a time or so: > > > > > > indices = np.arange(y.shape[0]) > > batch_size = 1000 > > > > for start_idx in range(0, indices.shape[0] - batch_size + 1, > > batch_size): > > index_slice = indices[start_idx:start_idx + batch_size] > > model.partial_fit(X[index_slice], y[index_slice], classes=[0, 1]) > > > > > > > > Best, > > Sebastian > > > > > >> On Jun 2, 2017, at 6:50 AM, Iván Vallés Pérez < > ivanvallespe...@gmail.com> wrote: > >> > >> Are you monitoring your RAM memory consumption? I would say that it is > the cause of the majority of the kernel crashes > >> El El vie, 2 jun 2017 a las 12:45, Aymen J <a...@hotmail.fr> escribió: > >> Hey Guys, > >> > >> > >> So I'm trying to fit an SGD classifier on a dataset that has 900,000 > for about 3,600 features (high cardinality). > >> > >> > >> Here is my model: > >> > >> > >> model = SGDClassifier(loss='log',penalty=None,alpha=0.0, > >> l1_ratio=0.0,fit_intercept=False,n_iter=1,shuffle=False, > learning_rate='constant', > >> eta0=1.0) > >> > >> When I run the model.fit function, The program runs for about 5 > minutes, and I receive the message "the kernel has died" from Jupyter. > >> > >> Any idea what may cause that? Is my training data too big (in terms of > features)? Can I do anything (parameters) to finish training? > >> > >> Thanks in advance for your help! > >> > >> > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn@python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn@python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn