Re: [Scikit-learn-general] Question and comments on RandomForests

Peter Prettenhofer Tue, 03 Jan 2012 02:53:22 -0800

2012/1/3 Andreas <[email protected]>:
> Hi.
> I just switched to DecisionTreeClassifier to make analysis easier.
> There should be no joblib there, right?


correct.

> One thing I noticed is that there is often
> ``np.argsort(X.T, axis=1).astype(np.int32)``
> which always does a copy.

This is done once at the beginning of build_tree and each time the
sample mask gets too sparse. In the latter case both `X` and `y` are
copied too. To see the memory overhead of ``np.argsort(X.T,
axis=1).astype(np.int32)`` at the beginning we should test with
`min_density=1.0` which turnes off fancy indexing of `X` and `y` and
re-computation of `X_sorted`.

> Still, as these should be garbage collected, I don't really see
> where all the memory goes...
> I'll give it a closer look later but I'll move to another box
> for now.
> Thanks everybody for the help!
> And sorry for keeping you @peter.
> Cheers,
> Andy
>
>
> On 01/03/2012 09:52 AM, Peter Prettenhofer wrote:
>> Hi,
>>
>> I just checked DecisionTreeClassifier - it basically requires the same
>> amout of memory for its internal data structures (= `X_sorted` which
>> is also 60.000 x 786 x 4 bytes). I haven't checked RandomForest but
>> you have to make sure that joblib does not fork a new process. If so,
>> the new process will have the same memory footprint as the parent
>> process (which is 2x the input size because X_sorted is precomputed).
>> Furthermore, because of pythons memory management I assume that the
>> data will be copied once more due to copy on write (actually, we dont
>> write X or X_sorted but we increment their reference counts which
>> should be enough to trigger a copy).
>>
>> best
>>
>> 2012/1/3 Gilles Louppe<[email protected]>:
>>
>>> Note also that when using bootstrap=True, copies of X have to be
>>> created for each tree.
>>>
>>> But this should work anyway since you only build 1 tree... Hmmm.
>>>
>>> Gilles
>>>
>>> On 3 January 2012 09:41, Peter Prettenhofer
>>> <[email protected]>  wrote:
>>>
>>>> Hi Andy,
>>>>
>>>> I'll investigate the issue with an artificial dataset of comparable
>>>> size - to be honest I suspect that we focused on speed at the cost of
>>>> memory usage...
>>>>
>>>> As a quick fix you could set `min_density=1` which will result in less
>>>> memory copies at the cost of runtime.
>>>>
>>>> best,
>>>>   Peter
>>>>
>>>> 2012/1/3 Andreas<[email protected]>:
>>>>
>>>>> Hi Gilles.
>>>>> Thanks! Will try that.
>>>>>
>>>>> Also thanks for working on the docs! :)
>>>>>
>>>>> Cheers,
>>>>> Andy
>>>>>
>>>>>
>>>>> On 01/03/2012 09:30 AM, Gilles Louppe wrote:
>>>>>
>>>>>> Hi Andras,
>>>>>>
>>>>>> Try setting min_split=10 or higher. With a dataset of that size, there
>>>>>> is no point in using min_split=1, you will 1) consume indeed too much
>>>>>> memory and 2) overfit.
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>> PS: I have just started to change to doc. Expect a PR later today :)
>>>>>>
>>>>>> On 3 January 2012 09:27, Andreas<[email protected]>    wrote:
>>>>>>
>>>>>>
>>>>>>> Hi Brian.
>>>>>>> The dataset itself is 60000 * 786 * 8 bytes (I converted from unit8 to
>>>>>>> float which is 8 bytes in Numpy I guess)
>>>>>>> which is ~ 360 MB (also I can load it ;).
>>>>>>> I trained linear SVMs and Neural networks without much trouble. I
>>>>>>> haven't really studied the
>>>>>>> decision tree code (which I know you made quite an effort to optimize)
>>>>>>> so I don't really
>>>>>>> have an idea how the construction works. Maybe I just had a
>>>>>>> misconception of the memory
>>>>>>> usage of the algorithm. I just started playing with it.
>>>>>>>
>>>>>>> Thanks for any comments :)
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Andy
>>>>>>>
>>>>>>>
>>>>>>> On 01/03/2012 09:06 AM, [email protected] wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Hi Andy,
>>>>>>>>
>>>>>>>> IIRC MNIST is 60000 samples, each with dimension 28x28, so the 2GB 
>>>>>>>> limit doesn't seem unreasonable (especially since you don't have all 
>>>>>>>> of that at your disposal). Does the dataset fit in mem?
>>>>>>>>
>>>>>>>> Brian
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Andreas<[email protected]>
>>>>>>>> Date: Tue, 03 Jan 2012 09:00:47
>>>>>>>> To:<[email protected]>
>>>>>>>> Reply-To: [email protected]
>>>>>>>> Subject: Re: [Scikit-learn-general] Question and comments on 
>>>>>>>> RandomForests
>>>>>>>>
>>>>>>>> One other question:
>>>>>>>> I tried to run a forest on MNIST, that actually consisted of only one 
>>>>>>>> tree.
>>>>>>>> That gave me a memory error. I only have 2gb ram in this machine
>>>>>>>> (this is my desktop at IST Austria !?) which is obviously not that 
>>>>>>>> much.
>>>>>>>> Still this kind of surprised me. Is it expected that a tree takes
>>>>>>>> this "much" ram? Should I change "min_density"?
>>>>>>>>
>>>>>>>> Thanks :)
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> Write once. Port to many.
>>>>>>>> Get the SDK and tools to simplify cross-platform app development. 
>>>>>>>> Create
>>>>>>>> new or port existing apps to sell to consumers worldwide. Explore the
>>>>>>>> Intel AppUpSM program developer opportunity. 
>>>>>>>> appdeveloper.intel.com/join
>>>>>>>> http://p.sf.net/sfu/intel-appdev
>>>>>>>> _______________________________________________
>>>>>>>> Scikit-learn-general mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>> Write once. Port to many.
>>>>>>>> Get the SDK and tools to simplify cross-platform app development. 
>>>>>>>> Create
>>>>>>>> new or port existing apps to sell to consumers worldwide. Explore the
>>>>>>>> Intel AppUpSM program developer opportunity. 
>>>>>>>> appdeveloper.intel.com/join
>>>>>>>> http://p.sf.net/sfu/intel-appdev
>>>>>>>> _______________________________________________
>>>>>>>> Scikit-learn-general mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> ------------------------------------------------------------------------------
>>>>>>> Write once. Port to many.
>>>>>>> Get the SDK and tools to simplify cross-platform app development. Create
>>>>>>> new or port existing apps to sell to consumers worldwide. Explore the
>>>>>>> Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
>>>>>>> http://p.sf.net/sfu/intel-appdev
>>>>>>> _______________________________________________
>>>>>>> Scikit-learn-general mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>>
>>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Write once. Port to many.
>>>>>> Get the SDK and tools to simplify cross-platform app development. Create
>>>>>> new or port existing apps to sell to consumers worldwide. Explore the
>>>>>> Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
>>>>>> http://p.sf.net/sfu/intel-appdev
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Write once. Port to many.
>>>>> Get the SDK and tools to simplify cross-platform app development. Create
>>>>> new or port existing apps to sell to consumers worldwide. Explore the
>>>>> Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
>>>>> http://p.sf.net/sfu/intel-appdev
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>
>>>>
>>>> --
>>>> Peter Prettenhofer
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Write once. Port to many.
>>>> Get the SDK and tools to simplify cross-platform app development. Create
>>>> new or port existing apps to sell to consumers worldwide. Explore the
>>>> Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
>>>> http://p.sf.net/sfu/intel-appdev
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>> ------------------------------------------------------------------------------
>>> Write once. Port to many.
>>> Get the SDK and tools to simplify cross-platform app development. Create
>>> new or port existing apps to sell to consumers worldwide. Explore the
>>> Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
>>> http://p.sf.net/sfu/intel-appdev
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>
>>
>>
>
>
> ------------------------------------------------------------------------------
> Write once. Port to many.
> Get the SDK and tools to simplify cross-platform app development. Create
> new or port existing apps to sell to consumers worldwide. Explore the
> Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
> http://p.sf.net/sfu/intel-appdev
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create 
new or port existing apps to sell to consumers worldwide. Explore the 
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Question and comments on RandomForests

Reply via email to