Thank you Olivier for these suggestions.

I'd try/test them with pleasure, but meanwhile I discovered that there
was just no way the dataset I was trying to use would ever fit in the
72GB of memory of the machine I'm using. So I just scaled it down, and
obviously this error is not happening anymore.

But I'd be curious to know if there are any mechanism I could use to
allow a Random Forest classifier to work with bigger datasets (than
what simply fits in memory)?

Thanks!


On 22 September 2012 16:18, Olivier Grisel <olivier.gri...@ensta.org> wrote:
> 2012/9/22 Christian Jauvin <cjau...@gmail.com>:
>> Hi,
>>
>> I have been doing multiple experiments using a RandomForestClassifier
>> (trained with the parallel code option) recently, without encountering
>> any particular problem. However as soon as I began using a much bigger
>> dataset (with the exact same code), I got this threading error:
>>
>> Exception in thread Thread-2:
>> Traceback (most recent call last):
>>   File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
>>     self.run()
>>   File "/usr/lib/python2.7/threading.py", line 504, in run
>>     self.__target(*self.__args, **self.__kwargs)
>>   File "/usr/lib/python2.7/multiprocessing/pool.py", line 319, in 
>> _handle_tasks
>>     put(task)
>> SystemError: NULL result without error in PyObject_Call
>>
>> I can provide additional details of course, but first maybe there is
>> something in particular I should be aware of, about size or memory
>> limit of the underlying objects in question?
>>
>
> It can be a memory error as the current implementation is very bad at
> managing the memory.
>
> You can try to replace the joblib folder in the sklearn source tree by
> the "pickling-pool" branch of my repo:
>
> https://github.com/joblib/joblib/pull/44
>
> That should help a lot. You can further memmap your original dataset
> has explained in the following doc to get even better memory usage
> reduction:
>
> https://github.com/ogrisel/joblib/blob/pickling-pool/doc/parallel_numpy.rst
>
> You might also want to set the TMP environment variable to a folder on
> a big partition.
>
> I am very interested in any feedback while using this branch.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
> ------------------------------------------------------------------------------
> How fast is your code?
> 3 out of 4 devs don\\\'t know how their code performs in production.
> Find out how slow your code is with AppDynamics Lite.
> http://ad.doubleclick.net/clk;262219672;13503038;z?
> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to