On 4 August 2015 at 18:25, Ronnie Ghose <ronnie.gh...@gmail.com> wrote:
> are you able to make a np.ones stand alone of that size? > Yes, I can create a np.ones array of size 100 000 000 approximatelly. On 4 August 2015 at 18:26, Andreas Mueller <t3k...@gmail.com> wrote: > That array would take about 700mb of ram. Do you have that much available? > Btw, you could work around this issue probably by using HashingVectorizer > instead of CountVectorizer. > Yes, I've got plenty of memory, even that Windows limits single processes to 2GB. I've tried HashingVectorizer before and it gives me another numpy memory error. Here is the trace: ################################### File "C:\..\main", line 26, in <module> vectors = vectorizer.fit_transform(data) File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 472, in transform X = self._get_hasher().transform(analyzer(doc) for doc in X) File "C:\Python27\lib\site-packages\sklearn\feature_extraction\hashing.py", line 129, in transform _hashing.transform(raw_X, self.n_features, self.dtype) File "_hashing.pyx", line 68, in sklearn.feature_extraction._hashing.transform (sklearn\feature_extraction\_hashing.c:1947) File "C:\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 1076, in resize a = concatenate( (a,)*n_copies) MemoryError ################################### Best, Maria > > > On 08/04/2015 01:20 PM, Maria Gorinova wrote: > > Hi Andy, > > Thanks, I updated to 0.16.1, but the problem persists. > len(j_indices) is 68 356 000 when running for range(0,2000) and exactly > half of that when running for range(0,1000). > > Sebastian, thank you for the suggestion, but again, the issue doesn't seem > to be that the process is using too much memory, thus calling the garbage > collector doesn't help. > > Best, > Maria > > On 4 August 2015 at 17:24, Andreas Mueller <t3k...@gmail.com> wrote: > >> Thanks Maria. >> What I was asking was that you could use the debugger to see what >> len(j_indices) is when it crashes. >> I'm not sure if there were improvements to this code since 0.15.2 but I'd >> encourage you to upgrade to 0.16.1 anyhow. >> >> Cheers, >> Andy >> >> >> >> On 08/04/2015 11:56 AM, Maria Gorinova wrote: >> >> Hi Andreas, >> >> Thank you for the reply. The error also happens if I load different >> files, yes, but here I am actually loading the SAME file "a.txt". >> Which I did, just to demonstrate how awkward the error is... I don't know >> what len(j_indices) is, that's in sklearn\feature_extraction\text.py as >> shown in the exception trace. The version I'm using is 0.15.2 (I think...) >> >> Best, >> Maria >> >> On 4 August 2015 at 16:30, Andreas Mueller <t3k...@gmail.com> wrote: >> >>> Just to make sure, you are actually loading different files, not the >>> same file over and over again, right? >>> It seems an odd place for a memory error. Which version of scikit-learn >>> are you using? >>> What is ``len(j_indices)``? >>> >>> >>> >>> On 08/04/2015 10:18 AM, Maria Gorinova wrote: >>> >>> Hello, >>> >>> (I think I might have sent this to the wrong address the first time, so >>> I'm sending it again) >>> >>> I have been trying to find my way around a weird memory error for days >>> now. If I'm doing something wrong and this question is completely dumb, >>> I'm sorry for spamming the maillist. But I'm desperate. >>> >>> When running this code, everything works as expected: >>> >>> ####################################### >>> import os >>> from sklearn.feature_extraction.text import CountVectorizer >>> >>> data = [] >>> for i in range(0, 1000): >>> filename = "a.txt" >>> data.append(os.path.join(DATA_DIR, filename)) >>> >>> vectorizer = CountVectorizer(encoding = 'utf-8-sig', input = 'filename') >>> vectors = vectorizer.fit_transform(data) >>> ####################################### >>> >>> However, if I change the range to (0, 2000) it gives me a Memory Error >>> with the following trace: >>> >>> ####################################### >>> Traceback (most recent call last): >>> File "C:\...\msin.py", line 16, in <module> >>> vectors = vectorizer.fit_transform(data) >>> File >>> "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line >>> 817, in fit_transform >>> self.fixed_vocabulary_) >>> File >>> "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line >>> 769, in _count_vocab >>> values = np.ones(len(j_indices)) >>> File "C:\Python27\lib\site-packages\numpy\core\numeric.py", line 178, >>> in ones >>> a = empty(shape, dtype, order) >>> MemoryError >>> ####################################### >>> >>> Notes: >>> - the file is about 200 000 characters / 40 000 words. >>> - OS is Windows 10. >>> - the python process takes about 340MB RAM at the moment of Memory Error. >>> - I've seen my python processes taking about 1.8GB before and there was >>> never a problem. So Windows killing the process because it's trying to use >>> too much memory doesn't seem to be the case here. >>> - I keep receiving the error even if I restrict the vocabulary size. >>> >>> Thanks in advance!!! >>> Maria >>> >>> >>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> >>> >>> _______________________________________________ >>> Scikit-learn-general mailing >>> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >>> >> >> >> ------------------------------------------------------------------------------ >> >> >> >> _______________________________________________ >> Scikit-learn-general mailing >> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> > > > ------------------------------------------------------------------------------ > > > > _______________________________________________ > Scikit-learn-general mailing > listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general