Hi Andreas, Thank you for the reply. The error also happens if I load different files, yes, but here I am actually loading the SAME file "a.txt". Which I did, just to demonstrate how awkward the error is... I don't know what len(j_indices) is, that's in sklearn\feature_extraction\text.py as shown in the exception trace. The version I'm using is 0.15.2 (I think...)
Best, Maria On 4 August 2015 at 16:30, Andreas Mueller <t3k...@gmail.com> wrote: > Just to make sure, you are actually loading different files, not the same > file over and over again, right? > It seems an odd place for a memory error. Which version of scikit-learn > are you using? > What is ``len(j_indices)``? > > > > On 08/04/2015 10:18 AM, Maria Gorinova wrote: > > Hello, > > (I think I might have sent this to the wrong address the first time, so > I'm sending it again) > > I have been trying to find my way around a weird memory error for days > now. If I'm doing something wrong and this question is completely dumb, > I'm sorry for spamming the maillist. But I'm desperate. > > When running this code, everything works as expected: > > ####################################### > import os > from sklearn.feature_extraction.text import CountVectorizer > > data = [] > for i in range(0, 1000): > filename = "a.txt" > data.append(os.path.join(DATA_DIR, filename)) > > vectorizer = CountVectorizer(encoding = 'utf-8-sig', input = 'filename') > vectors = vectorizer.fit_transform(data) > ####################################### > > However, if I change the range to (0, 2000) it gives me a Memory Error > with the following trace: > > ####################################### > Traceback (most recent call last): > File "C:\...\msin.py", line 16, in <module> > vectors = vectorizer.fit_transform(data) > File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", > line 817, in fit_transform > self.fixed_vocabulary_) > File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", > line 769, in _count_vocab > values = np.ones(len(j_indices)) > File "C:\Python27\lib\site-packages\numpy\core\numeric.py", line 178, in > ones > a = empty(shape, dtype, order) > MemoryError > ####################################### > > Notes: > - the file is about 200 000 characters / 40 000 words. > - OS is Windows 10. > - the python process takes about 340MB RAM at the moment of Memory Error. > - I've seen my python processes taking about 1.8GB before and there was > never a problem. So Windows killing the process because it's trying to use > too much memory doesn't seem to be the case here. > - I keep receiving the error even if I restrict the vocabulary size. > > Thanks in advance!!! > Maria > > > > > > > ------------------------------------------------------------------------------ > > > > _______________________________________________ > Scikit-learn-general mailing > listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general