Hm, I have never used Python on Windows but I have heard from many people that 
it is way buggier than the Posix equivalent; maybe it's just a quirk of the 
garbage collector?

Maybe you could try to add the following lines:

gc.collect()
len(gc.get_objects()) 
inside your for-loop and give it another try? I know, it looks weird to "clear" 
the garbage collector this way, but it worked for me when I had also memory 
issues running it on a torque cluster. 


> On Aug 4, 2015, at 11:56 AM, Maria Gorinova <m.gorin...@gmail.com> wrote:
> 
> Hi Andreas,
> 
> Thank you for the reply. The error also happens if I load different files, 
> yes, but here I am actually loading the SAME file "a.txt". Which I did, just 
> to demonstrate how awkward the error is... I don't know what len(j_indices) 
> is, that's in sklearn\feature_extraction\text.py as shown in the exception 
> trace. The version I'm using is 0.15.2 (I think...)
> 
> Best,
> Maria
> 
> On 4 August 2015 at 16:30, Andreas Mueller <t3k...@gmail.com 
> <mailto:t3k...@gmail.com>> wrote:
> Just to make sure, you are actually loading different files, not the same 
> file over and over again, right?
> It seems an odd place for a memory error. Which version of scikit-learn are 
> you using?
> What is ``len(j_indices)``?
> 
> 
> 
> On 08/04/2015 10:18 AM, Maria Gorinova wrote:
>> Hello,
>> 
>> (I think I might have sent this to the wrong address the first time, so I'm 
>> sending it again)
>> 
>> I have been trying to find my way around a weird memory error for days now. 
>> If I'm doing something wrong and this question is completely dumb, I'm sorry 
>> for spamming the maillist. But I'm desperate.
>> 
>> When running this code, everything works as expected:
>> 
>> #######################################
>> import os
>> from sklearn.feature_extraction.text import CountVectorizer
>> 
>> data = []
>> for i in range(0, 1000):
>>     filename = "a.txt"
>>     data.append(os.path.join(DATA_DIR, filename))
>> 
>> vectorizer = CountVectorizer(encoding = 'utf-8-sig', input = 'filename')
>> vectors = vectorizer.fit_transform(data)
>> #######################################
>> 
>> However, if I change the range to (0, 2000) it gives me a Memory Error with 
>> the following trace:
>> 
>> #######################################
>> Traceback (most recent call last):
>>   File "C:\...\msin.py", line 16, in <module>
>>     vectors = vectorizer.fit_transform(data)
>>   File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", 
>> line 817, in fit_transform
>>     self.fixed_vocabulary_)
>>   File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", 
>> line 769, in _count_vocab
>>     values = np.ones(len(j_indices))
>>   File "C:\Python27\lib\site-packages\numpy\core\numeric.py", line 178, in 
>> ones
>>     a = empty(shape, dtype, order)
>> MemoryError
>> #######################################
>> 
>> Notes:
>> - the file is about 200 000 characters / 40 000 words.
>> - OS is Windows 10.
>> - the python process takes about 340MB RAM at the moment of Memory Error.
>> - I've seen my python processes taking about 1.8GB before and there was 
>> never a problem. So Windows killing the process because it's trying to use 
>> too much memory doesn't seem to be the case here. 
>> - I keep receiving the error even if I restrict the vocabulary size.
>> 
>> Thanks in advance!!!
>> Maria
>> 
>>  
>> 
>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> 
>> 
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net 
>> <mailto:Scikit-learn-general@lists.sourceforge.net>
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general 
>> <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
> 
> 
> ------------------------------------------------------------------------------
> 
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net 
> <mailto:Scikit-learn-general@lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general 
> <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
> 
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to