Hi Andreas,

Thank you for the reply. The error also happens if I load different files,
yes, but here I am actually loading the SAME file "a.txt". Which I did,
just to demonstrate how awkward the error is... I don't know what
len(j_indices) is, that's in sklearn\feature_extraction\text.py as shown in
the exception trace. The version I'm using is 0.15.2 (I think...)

Best,
Maria

On 4 August 2015 at 16:30, Andreas Mueller <t3k...@gmail.com> wrote:

> Just to make sure, you are actually loading different files, not the same
> file over and over again, right?
> It seems an odd place for a memory error. Which version of scikit-learn
> are you using?
> What is ``len(j_indices)``?
>
>
>
> On 08/04/2015 10:18 AM, Maria Gorinova wrote:
>
> Hello,
>
> (I think I might have sent this to the wrong address the first time, so
> I'm sending it again)
>
> I have been trying to find my way around a weird memory error for days
> now. If I'm doing something wrong and this question is completely dumb,
> I'm sorry for spamming the maillist. But I'm desperate.
>
> When running this code, everything works as expected:
>
> #######################################
> import os
> from sklearn.feature_extraction.text import CountVectorizer
>
> data = []
> for i in range(0, 1000):
>     filename = "a.txt"
>     data.append(os.path.join(DATA_DIR, filename))
>
> vectorizer = CountVectorizer(encoding = 'utf-8-sig', input = 'filename')
> vectors = vectorizer.fit_transform(data)
> #######################################
>
> However, if I change the range to (0, 2000) it gives me a Memory Error
> with the following trace:
>
> #######################################
> Traceback (most recent call last):
>   File "C:\...\msin.py", line 16, in <module>
>     vectors = vectorizer.fit_transform(data)
>   File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py",
> line 817, in fit_transform
>     self.fixed_vocabulary_)
>   File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py",
> line 769, in _count_vocab
>     values = np.ones(len(j_indices))
>   File "C:\Python27\lib\site-packages\numpy\core\numeric.py", line 178, in
> ones
>     a = empty(shape, dtype, order)
> MemoryError
> #######################################
>
> Notes:
> - the file is about 200 000 characters / 40 000 words.
> - OS is Windows 10.
> - the python process takes about 340MB RAM at the moment of Memory Error.
> - I've seen my python processes taking about 1.8GB before and there was
> never a problem. So Windows killing the process because it's trying to use
> too much memory doesn't seem to be the case here.
> - I keep receiving the error even if I restrict the vocabulary size.
>
> Thanks in advance!!!
> Maria
>
>
>
>
>
>
> ------------------------------------------------------------------------------
>
>
>
> _______________________________________________
> Scikit-learn-general mailing 
> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to