That array would take about 700mb of ram. Do you have that much available?
Btw, you could work around this issue probably by using HashingVectorizer instead of CountVectorizer.

On 08/04/2015 01:20 PM, Maria Gorinova wrote:
Hi Andy,

Thanks, I updated to 0.16.1, but the problem persists.
len(j_indices) is 68 356 000 when running for range(0,2000) and exactly half of that when running for range(0,1000).

Sebastian, thank you for the suggestion, but again, the issue doesn't seem to be that the process is using too much memory, thus calling the garbage collector doesn't help.

Best,
Maria

On 4 August 2015 at 17:24, Andreas Mueller <t3k...@gmail.com <mailto:t3k...@gmail.com>> wrote:

    Thanks Maria.
    What I was asking was that you could use the debugger to see what
    len(j_indices) is when it crashes.
    I'm not sure if there were improvements to this code since 0.15.2
    but I'd encourage you to upgrade to 0.16.1 anyhow.

    Cheers,
    Andy



    On 08/04/2015 11:56 AM, Maria Gorinova wrote:
    Hi Andreas,

    Thank you for the reply. The error also happens if I load
    different files, yes, but here I am actually loading the SAME
    file "a.txt". Which I did, just to demonstrate how awkward the
    error is... I don't know what len(j_indices) is, that's in
    sklearn\feature_extraction\text.py as shown in the exception
    trace. The version I'm using is 0.15.2 (I think...)

    Best,
    Maria

    On 4 August 2015 at 16:30, Andreas Mueller <t3k...@gmail.com
    <mailto:t3k...@gmail.com>> wrote:

        Just to make sure, you are actually loading different files,
        not the same file over and over again, right?
        It seems an odd place for a memory error. Which version of
        scikit-learn are you using?
        What is ``len(j_indices)``?



        On 08/04/2015 10:18 AM, Maria Gorinova wrote:
        Hello,

        (I think I might have sent this to the wrong address the
        first time, so I'm sending it again)

        I have been trying to find my way around a weird memory
        error for days now. If I'm doing something wrong and this
        question is completely dumb, I'm sorry for spamming the
        maillist. But I'm desperate.

        When running this code, everything works as expected:

        #######################################
        import os
        from sklearn.feature_extraction.text import CountVectorizer

        data = []
        for i in range(0, 1000):
            filename = "a.txt"
        data.append(os.path.join(DATA_DIR, filename))

        vectorizer = CountVectorizer(encoding = 'utf-8-sig', input =
        'filename')
        vectors = vectorizer.fit_transform(data)
        #######################################

        However, if I change the range to (0, 2000) it gives me a
        Memory Error with the following trace:

        #######################################
        Traceback (most recent call last):
          File "C:\...\msin.py", line 16, in <module>
            vectors = vectorizer.fit_transform(data)
          File
        "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py",
        line 817, in fit_transform
            self.fixed_vocabulary_)
          File
        "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py",
        line 769, in _count_vocab
            values = np.ones(len(j_indices))
          File
        "C:\Python27\lib\site-packages\numpy\core\numeric.py", line
        178, in ones
            a = empty(shape, dtype, order)
        MemoryError
        #######################################

        Notes:
        - the file is about 200 000 characters / 40 000 words.
        - OS is Windows 10.
        - the python process takes about 340MB RAM at the moment of
        Memory Error.
        - I've seen my python processes taking about 1.8GB before
        and there was never a problem. So Windows killing the
        process because it's trying to use too much memory doesn't
        seem to be the case here.
        - I keep receiving the error even if I restrict the
        vocabulary size.

        Thanks in advance!!!
        Maria





        
------------------------------------------------------------------------------


        _______________________________________________
        Scikit-learn-general mailing list
        Scikit-learn-general@lists.sourceforge.net  
<mailto:Scikit-learn-general@lists.sourceforge.net>
        https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


        
------------------------------------------------------------------------------

        _______________________________________________
        Scikit-learn-general mailing list
        Scikit-learn-general@lists.sourceforge.net
        <mailto:Scikit-learn-general@lists.sourceforge.net>
        https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




    
------------------------------------------------------------------------------


    _______________________________________________
    Scikit-learn-general mailing list
    Scikit-learn-general@lists.sourceforge.net  
<mailto:Scikit-learn-general@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


    
------------------------------------------------------------------------------

    _______________________________________________
    Scikit-learn-general mailing list
    Scikit-learn-general@lists.sourceforge.net
    <mailto:Scikit-learn-general@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




------------------------------------------------------------------------------


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to