Hi Lorenzo,

please make sure to pickle with the highest protocol - otherwise pickle
uses a textual serialization format which is quite inefficient:

  pickle.dump(clf, f, protocol=pickle.HIGHEST_PROTOCOL)

For large datasets limit the number of tree nodes by specifying
``min_samples_leaf`` -- setting this to 5 can give you a 5-fold
memory/disk-space reduction w/o much loss in accuracy (pls benchmark).

Otherwise, I'd suggest you use joblib.dump instead of pickle because it is
faster for numpy arrays and has an option to compress (beware of high
memory consumption during compression though)

HTH,
 Peter


2014-02-26 17:37 GMT+01:00 Lorenzo Isella <lorenzo.ise...@gmail.com>:

> Dear All,
> I am using RandomForest on a data set which has less than 20 features, but
> about 400000 lines.
> The point is that, even if I work on  a subset of about 30000 lines to
> train my model, when I save it using pickle, I get a large file in the
> order of several hundreds of Mb of space (see the snippet at the end of
> the email).
> I can then later load the model by doing the following
>
> In [8]: pkl_file = open("rf_wallmart_holidays.txt")
>
> In [9]: clf = pickle.load(pkl_file)
>
> In [10]: pkl_file.close()
>
> However, I am concerned thay when I use the whole dataset, I will get a
> model size of the order of several Gb and I wonder if I will be able to
> load it via pickle as I do above.
> I am just wondering if I am making any gross mistake (I have never used
> pickle in the past).
> Any suggestions about efficient ways to store/read the models developed
> with sklearn is appreciated.
> Regards
>
> Lorenzo
>
>
>
> ################################################################################
>
>
> clf = RandomForestRegressor(n_estimators=150,\
>                              # compute_importances = True, \
>                              n_jobs=2, verbose=3)
>
> sales=train.Weekly_Sales
>
> my_cols = set(train.columns)
>
> my_cols.remove("Weekly_Sales")
>
> my_cols = list(my_cols)
>
> clf.fit(train[my_cols], sales)
>
>
>
> f = open('rf_wallmart_non_holidays.txt','wb')
> pickle.dump(clf,f)
>
>
> ------------------------------------------------------------------------------
> Flow-based real-time traffic analytics software. Cisco certified tool.
> Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
> Customize your own dashboards, set traffic alerts and generate reports.
> Network behavioral analysis & security monitoring. All-in-one tool.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
Peter Prettenhofer
------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to