Re: [Scikit-learn-general] Weird memory error

Maria Gorinova Tue, 04 Aug 2015 10:50:24 -0700

On 4 August 2015 at 18:25, Ronnie Ghose <ronnie.gh...@gmail.com> wrote:


> are you able to make a np.ones stand alone of that size?
>

Yes, I can create a np.ones array of size 100 000 000 approximatelly.

On 4 August 2015 at 18:26, Andreas Mueller <t3k...@gmail.com> wrote:

> That array would take about 700mb of ram. Do you have that much available?
> Btw, you could work around this issue probably by using HashingVectorizer
> instead of CountVectorizer.
>

Yes, I've got plenty of memory, even that Windows limits single processes
to 2GB. I've tried HashingVectorizer before and it gives me another numpy
memory error. Here is the trace:

###################################
  File "C:\..\main", line 26, in <module>
    vectors = vectorizer.fit_transform(data)
  File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py",
line 472, in transform
    X = self._get_hasher().transform(analyzer(doc) for doc in X)
  File
"C:\Python27\lib\site-packages\sklearn\feature_extraction\hashing.py", line
129, in transform
    _hashing.transform(raw_X, self.n_features, self.dtype)
  File "_hashing.pyx", line 68, in
sklearn.feature_extraction._hashing.transform
(sklearn\feature_extraction\_hashing.c:1947)
  File "C:\Python27\lib\site-packages\numpy\core\fromnumeric.py", line
1076, in resize
    a = concatenate( (a,)*n_copies)
MemoryError
###################################

Best,
Maria


>
>
> On 08/04/2015 01:20 PM, Maria Gorinova wrote:
>
> Hi Andy,
>
> Thanks, I updated to 0.16.1, but the problem persists.
> len(j_indices) is 68 356 000 when running for range(0,2000) and exactly
> half of that when running for range(0,1000).
>
> Sebastian, thank you for the suggestion, but again, the issue doesn't seem
> to be that the process is using too much memory, thus calling the garbage
> collector doesn't help.
>
> Best,
> Maria
>
> On 4 August 2015 at 17:24, Andreas Mueller <t3k...@gmail.com> wrote:
>
>> Thanks Maria.
>> What I was asking was that you could use the debugger to see what
>> len(j_indices) is when it crashes.
>> I'm not sure if there were improvements to this code since 0.15.2 but I'd
>> encourage you to upgrade to 0.16.1 anyhow.
>>
>> Cheers,
>> Andy
>>
>>
>>
>> On 08/04/2015 11:56 AM, Maria Gorinova wrote:
>>
>> Hi Andreas,
>>
>> Thank you for the reply. The error also happens if I load different
>> files, yes, but here I am actually loading the SAME file "a.txt".
>> Which I did, just to demonstrate how awkward the error is... I don't know
>> what len(j_indices) is, that's in sklearn\feature_extraction\text.py as
>> shown in the exception trace. The version I'm using is 0.15.2 (I think...)
>>
>> Best,
>> Maria
>>
>> On 4 August 2015 at 16:30, Andreas Mueller <t3k...@gmail.com> wrote:
>>
>>> Just to make sure, you are actually loading different files, not the
>>> same file over and over again, right?
>>> It seems an odd place for a memory error. Which version of scikit-learn
>>> are you using?
>>> What is ``len(j_indices)``?
>>>
>>>
>>>
>>> On 08/04/2015 10:18 AM, Maria Gorinova wrote:
>>>
>>> Hello,
>>>
>>> (I think I might have sent this to the wrong address the first time, so
>>> I'm sending it again)
>>>
>>> I have been trying to find my way around a weird memory error for days
>>> now. If I'm doing something wrong and this question is completely dumb,
>>> I'm sorry for spamming the maillist. But I'm desperate.
>>>
>>> When running this code, everything works as expected:
>>>
>>> #######################################
>>> import os
>>> from sklearn.feature_extraction.text import CountVectorizer
>>>
>>> data = []
>>> for i in range(0, 1000):
>>>     filename = "a.txt"
>>>     data.append(os.path.join(DATA_DIR, filename))
>>>
>>> vectorizer = CountVectorizer(encoding = 'utf-8-sig', input = 'filename')
>>> vectors = vectorizer.fit_transform(data)
>>> #######################################
>>>
>>> However, if I change the range to (0, 2000) it gives me a Memory Error
>>> with the following trace:
>>>
>>> #######################################
>>> Traceback (most recent call last):
>>>   File "C:\...\msin.py", line 16, in <module>
>>>     vectors = vectorizer.fit_transform(data)
>>>   File
>>> "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
>>> 817, in fit_transform
>>>     self.fixed_vocabulary_)
>>>   File
>>> "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
>>> 769, in _count_vocab
>>>     values = np.ones(len(j_indices))
>>>   File "C:\Python27\lib\site-packages\numpy\core\numeric.py", line 178,
>>> in ones
>>>     a = empty(shape, dtype, order)
>>> MemoryError
>>> #######################################
>>>
>>> Notes:
>>> - the file is about 200 000 characters / 40 000 words.
>>> - OS is Windows 10.
>>> - the python process takes about 340MB RAM at the moment of Memory Error.
>>> - I've seen my python processes taking about 1.8GB before and there was
>>> never a problem. So Windows killing the process because it's trying to use
>>> too much memory doesn't seem to be the case here.
>>> - I keep receiving the error even if I restrict the vocabulary size.
>>>
>>> Thanks in advance!!!
>>> Maria
>>>
>>>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>>
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing 
>>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>>
>>
>> _______________________________________________
>> Scikit-learn-general mailing 
>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
>
>
>
> _______________________________________________
> Scikit-learn-general mailing 
> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Weird memory error

Reply via email to