2012/1/3 Andreas <[email protected]>: > Hi. > I just switched to DecisionTreeClassifier to make analysis easier. > There should be no joblib there, right?
correct. > One thing I noticed is that there is often > ``np.argsort(X.T, axis=1).astype(np.int32)`` > which always does a copy. This is done once at the beginning of build_tree and each time the sample mask gets too sparse. In the latter case both `X` and `y` are copied too. To see the memory overhead of ``np.argsort(X.T, axis=1).astype(np.int32)`` at the beginning we should test with `min_density=1.0` which turnes off fancy indexing of `X` and `y` and re-computation of `X_sorted`. > Still, as these should be garbage collected, I don't really see > where all the memory goes... > I'll give it a closer look later but I'll move to another box > for now. > Thanks everybody for the help! > And sorry for keeping you @peter. > Cheers, > Andy > > > On 01/03/2012 09:52 AM, Peter Prettenhofer wrote: >> Hi, >> >> I just checked DecisionTreeClassifier - it basically requires the same >> amout of memory for its internal data structures (= `X_sorted` which >> is also 60.000 x 786 x 4 bytes). I haven't checked RandomForest but >> you have to make sure that joblib does not fork a new process. If so, >> the new process will have the same memory footprint as the parent >> process (which is 2x the input size because X_sorted is precomputed). >> Furthermore, because of pythons memory management I assume that the >> data will be copied once more due to copy on write (actually, we dont >> write X or X_sorted but we increment their reference counts which >> should be enough to trigger a copy). >> >> best >> >> 2012/1/3 Gilles Louppe<[email protected]>: >> >>> Note also that when using bootstrap=True, copies of X have to be >>> created for each tree. >>> >>> But this should work anyway since you only build 1 tree... Hmmm. >>> >>> Gilles >>> >>> On 3 January 2012 09:41, Peter Prettenhofer >>> <[email protected]> wrote: >>> >>>> Hi Andy, >>>> >>>> I'll investigate the issue with an artificial dataset of comparable >>>> size - to be honest I suspect that we focused on speed at the cost of >>>> memory usage... >>>> >>>> As a quick fix you could set `min_density=1` which will result in less >>>> memory copies at the cost of runtime. >>>> >>>> best, >>>> Peter >>>> >>>> 2012/1/3 Andreas<[email protected]>: >>>> >>>>> Hi Gilles. >>>>> Thanks! Will try that. >>>>> >>>>> Also thanks for working on the docs! :) >>>>> >>>>> Cheers, >>>>> Andy >>>>> >>>>> >>>>> On 01/03/2012 09:30 AM, Gilles Louppe wrote: >>>>> >>>>>> Hi Andras, >>>>>> >>>>>> Try setting min_split=10 or higher. With a dataset of that size, there >>>>>> is no point in using min_split=1, you will 1) consume indeed too much >>>>>> memory and 2) overfit. >>>>>> >>>>>> Gilles >>>>>> >>>>>> PS: I have just started to change to doc. Expect a PR later today :) >>>>>> >>>>>> On 3 January 2012 09:27, Andreas<[email protected]> wrote: >>>>>> >>>>>> >>>>>>> Hi Brian. >>>>>>> The dataset itself is 60000 * 786 * 8 bytes (I converted from unit8 to >>>>>>> float which is 8 bytes in Numpy I guess) >>>>>>> which is ~ 360 MB (also I can load it ;). >>>>>>> I trained linear SVMs and Neural networks without much trouble. I >>>>>>> haven't really studied the >>>>>>> decision tree code (which I know you made quite an effort to optimize) >>>>>>> so I don't really >>>>>>> have an idea how the construction works. Maybe I just had a >>>>>>> misconception of the memory >>>>>>> usage of the algorithm. I just started playing with it. >>>>>>> >>>>>>> Thanks for any comments :) >>>>>>> >>>>>>> Cheers, >>>>>>> Andy >>>>>>> >>>>>>> >>>>>>> On 01/03/2012 09:06 AM, [email protected] wrote: >>>>>>> >>>>>>> >>>>>>>> Hi Andy, >>>>>>>> >>>>>>>> IIRC MNIST is 60000 samples, each with dimension 28x28, so the 2GB >>>>>>>> limit doesn't seem unreasonable (especially since you don't have all >>>>>>>> of that at your disposal). Does the dataset fit in mem? >>>>>>>> >>>>>>>> Brian >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Andreas<[email protected]> >>>>>>>> Date: Tue, 03 Jan 2012 09:00:47 >>>>>>>> To:<[email protected]> >>>>>>>> Reply-To: [email protected] >>>>>>>> Subject: Re: [Scikit-learn-general] Question and comments on >>>>>>>> RandomForests >>>>>>>> >>>>>>>> One other question: >>>>>>>> I tried to run a forest on MNIST, that actually consisted of only one >>>>>>>> tree. >>>>>>>> That gave me a memory error. I only have 2gb ram in this machine >>>>>>>> (this is my desktop at IST Austria !?) which is obviously not that >>>>>>>> much. >>>>>>>> Still this kind of surprised me. Is it expected that a tree takes >>>>>>>> this "much" ram? Should I change "min_density"? >>>>>>>> >>>>>>>> Thanks :) >>>>>>>> >>>>>>>> Andy >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> Write once. Port to many. >>>>>>>> Get the SDK and tools to simplify cross-platform app development. >>>>>>>> Create >>>>>>>> new or port existing apps to sell to consumers worldwide. Explore the >>>>>>>> Intel AppUpSM program developer opportunity. >>>>>>>> appdeveloper.intel.com/join >>>>>>>> http://p.sf.net/sfu/intel-appdev >>>>>>>> _______________________________________________ >>>>>>>> Scikit-learn-general mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> Write once. Port to many. >>>>>>>> Get the SDK and tools to simplify cross-platform app development. >>>>>>>> Create >>>>>>>> new or port existing apps to sell to consumers worldwide. Explore the >>>>>>>> Intel AppUpSM program developer opportunity. >>>>>>>> appdeveloper.intel.com/join >>>>>>>> http://p.sf.net/sfu/intel-appdev >>>>>>>> _______________________________________________ >>>>>>>> Scikit-learn-general mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> Write once. Port to many. >>>>>>> Get the SDK and tools to simplify cross-platform app development. Create >>>>>>> new or port existing apps to sell to consumers worldwide. Explore the >>>>>>> Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join >>>>>>> http://p.sf.net/sfu/intel-appdev >>>>>>> _______________________________________________ >>>>>>> Scikit-learn-general mailing list >>>>>>> [email protected] >>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>>> >>>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> Write once. Port to many. >>>>>> Get the SDK and tools to simplify cross-platform app development. Create >>>>>> new or port existing apps to sell to consumers worldwide. Explore the >>>>>> Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join >>>>>> http://p.sf.net/sfu/intel-appdev >>>>>> _______________________________________________ >>>>>> Scikit-learn-general mailing list >>>>>> [email protected] >>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>>> >>>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Write once. Port to many. >>>>> Get the SDK and tools to simplify cross-platform app development. Create >>>>> new or port existing apps to sell to consumers worldwide. Explore the >>>>> Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join >>>>> http://p.sf.net/sfu/intel-appdev >>>>> _______________________________________________ >>>>> Scikit-learn-general mailing list >>>>> [email protected] >>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>> >>>> >>>> >>>> -- >>>> Peter Prettenhofer >>>> >>>> ------------------------------------------------------------------------------ >>>> Write once. Port to many. >>>> Get the SDK and tools to simplify cross-platform app development. Create >>>> new or port existing apps to sell to consumers worldwide. Explore the >>>> Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join >>>> http://p.sf.net/sfu/intel-appdev >>>> _______________________________________________ >>>> Scikit-learn-general mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>> >>> ------------------------------------------------------------------------------ >>> Write once. Port to many. >>> Get the SDK and tools to simplify cross-platform app development. Create >>> new or port existing apps to sell to consumers worldwide. Explore the >>> Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join >>> http://p.sf.net/sfu/intel-appdev >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >> >> >> > > > ------------------------------------------------------------------------------ > Write once. Port to many. > Get the SDK and tools to simplify cross-platform app development. Create > new or port existing apps to sell to consumers worldwide. Explore the > Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join > http://p.sf.net/sfu/intel-appdev > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer ------------------------------------------------------------------------------ Write once. Port to many. Get the SDK and tools to simplify cross-platform app development. Create new or port existing apps to sell to consumers worldwide. Explore the Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join http://p.sf.net/sfu/intel-appdev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
