Re: [Scikit-learn-general] How you free up memory or handle it while fitting/cross-validating model in Scikitlearn?

Sebastian Raschka Thu, 18 Feb 2016 10:43:03 -0800

> @Your code: Is this the full code or some part of it is missing? I can see
> ... 
> after


Yes, there is part of it missing -- I removed it for clarity. It's essentially 
just a whole bunch of nested for-loops (bad-style anyway, but that was just a 
quick work-around). It's basically just iterating over different parameters 
sets to do the grid-search "manually."

Btw I just saw that scikit-learn 0.17.1 came out today including an updated 
version of joblib. Maybe it's worth a try to see if it may solve the problem?




> On Feb 18, 2016, at 1:39 PM, Sebastian Raschka <[email protected]> wrote:
> 
>> @Your code: Is this the full code or some part of it is missing? I can see
>> ... 
>> after
> 
> Yes, there is part of it missing -- I removed it for clarity. It's 
> essentially just a whole bunch of nested for-loops (bad-style anyway, but 
> that was just a quick work-around). It's basically just iterating over 
> different parameters sets to do the grid-search "manually."
> 
> Btw I just saw that scikit-learn 0.17.1 came out today including an updated 
> version of joblib. Maybe it's worth a try to see if it may solve the problem?
> 
> 
>> On Feb 17, 2016, at 2:53 PM, muhammad waseem <[email protected]> 
>> wrote:
>> 
>> @Sebastian: I will add in the discussion, it looks like it is not very 
>> active :(
>> 
>> @Your code: Is this the full code or some part of it is missing? I can see
>> ... 
>> after
>> parameterset2:

>> for p2 in 
>> which means there is some thing missing there, no? 
>> 
>> Thanks
>> 
>> On Wed, Feb 17, 2016 at 7:40 PM, Sebastian Raschka <[email protected]> 
>> wrote:
>> @Waseem Oh, wait, I just see that we already have an open issue for that, 
>> please see: https://github.com/scikit-learn/scikit-learn/issues/3973 Would 
>> be great if you could add to the discussion there. Meanwhile, I will try to 
>> run my code again in the next few days to check if this bug still persists.
>> 
>> 
>> > On Feb 15, 2016, at 4:25 PM, Sebastian Raschka <[email protected]> 
>> > wrote:
>> >
>> > Hm, unfortunately, that's what I thought -- sounds like a bug involved in 
>> > joblib? Does someone has any ideas how to track this down?
>> >
>> > @Waseem Can you also try n_jobs=2? Here, I'd expect that it
>> > 1)  would use maybe 2 times the 12% plus a little bit extra if everything 
>> > is working correctly with the multi-threading.
>> > 2) If you see something like ~30%, I'd say that there's an unnecessary 
>> > copy made
>> > 3) If you see something like > 30% there would be a memory leak somewhere
>> >
>> > I mentioned scenario 3, because I observed a very similar behavior once:
>> > (see https://github.com/scikit-learn/scikit-learn/issues/3973)
>> >
>> > "I made some weird observations that my GridSearches keep failing after a 
>> > couple of hours and I initially couldn't figure out why. I monitored the 
>> > memory usage then over time and saw that it it started with a few 
>> > gigabytes (~6 Gb) and kept increasing until it crashed the node when it 
>> > reached the max. 128 Gb the hardware can take. I was experimenting with 
>> > random forests for classification of a large number of text documents. For 
>> > simplicity -- to figure out what's going on -- I went back to naive Bayes.
>> > ...
>> > After some experimentation, I finally found out that
>> >
>> > gc.collect()
>> > len(gc.get_objects()) # particularly this part!
>> >
>> > in the for loop solves the problem and the memory usage stays constantly 
>> > at 6.5 Gb over the run time of ~10 hours.
>> >
>> >
>> >> On Feb 15, 2016, at 9:37 AM, muhammad waseem <[email protected]> 
>> >> wrote:
>> >>
>> >> @Sebastian: I have tried to run cross_validation by using n_jobs=1 and it 
>> >> did not use SWAP memory, even the RAM usage was quite low (maximum 12%). 
>> >> However, this will take a longer time to finish. Any idea what to try now?
>> >>
>> >> Thanks
>> >> Kindest Regards
>> >> Waseem
>> >>
>> >> On Fri, Feb 12, 2016 at 9:58 PM, Jacob Schreiber 
>> >> <[email protected]> wrote:
>> >> I don't think that the data is copied for tree based classifiers. It uses 
>> >> the threading backend, so each thread should be sharing memory.
>> >>
>> >> On Fri, Feb 12, 2016 at 12:32 PM, Sebastian Raschka 
>> >> <[email protected]> wrote:
>> >> I'd suggest trying n_jobs=1 and check if swap memory is used (you don't 
>> >> have to run it until completion). If this runs fine without swap, we can 
>> >> work further from there.
>> >>
>> >> Sent from my iPhone
>> >>
>> >> On Feb 12, 2016, at 2:57 PM, muhammad waseem <[email protected]> 
>> >> wrote:
>> >>
>> >>> @Sebastian: I tried with n_jobs=10 (total is equal to 12) and it still 
>> >>> created the same problem. I could try running it by using n_jobs=1 but 
>> >>> it would be so slow that it will take ages to complete. The machine has 
>> >>> 32GB RAM and it started using Swap memory after consuming full RAM.
>> >>>
>> >>> Is there a way to tackle or you really think that all this k-fold cross 
>> >>> validation, training should be done using Spark's MLib?
>> >>>
>> >>> Thanks
>> >>> Regards
>> >>> Waseem
>> >>>
>> >>>
>> >>> On Fri, Feb 12, 2016 at 6:40 PM, Sebastian Raschka 
>> >>> <[email protected]> wrote:
>> >>> Thanks for the note, Manoj, didn't know that!
>> >>>
>> >>> @muhammad So if there's no duplication of data across all processes, I 
>> >>> guess that the you would also run into troubles with n_jobs=1. But just 
>> >>> to make sure that data duplication is not an issue, could you try 
>> >>> running it with n_jobs=1? In this case, probably only a smaller data set 
>> >>> or machine with larger memory would help. Here, I'd probably think about 
>> >>> using Spark's MLlib to deal with this particular dataset.
>> >>>
>> >>>> On Feb 12, 2016, at 12:30 PM, muhammad waseem 
>> >>>> <[email protected]> wrote:
>> >>>>
>> >>>> Hi Sebastian and Manoj,
>> >>>> @Manoj: What should be the value of max_nbytes parameter and will this 
>> >>>> affect the results and time it takes to run cross_validation, 
>> >>>> grid_search etc?
>> >>>> @Sebastian: Will the Spark implication will also improve the memory use 
>> >>>> or just the CPU?
>> >>>>
>> >>>>
>> >>>> Thanks
>> >>>> Kindest Regards
>> >>>>
>> >>>> On Fri, Feb 12, 2016 at 5:29 PM, muhammad waseem 
>> >>>> <[email protected]> wrote:
>> >>>> Hi Sebastian and Manoj,
>> >>>> @Manoj: What should be the value of max_nbytes parameter and will this 
>> >>>> affect the results and time it takes to run cross_validation, 
>> >>>> grid_search etc?
>> >>>>
>> >>>> Thanks
>> >>>> Kindest Regards
>> >>>> Waseem
>> 
> 


------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] How you free up memory or handle it while fitting/cross-validating model in Scikitlearn?

Reply via email to