Dear All,
I am using RandomForest on a data set which has less than 20 features, but  
about 400000 lines.
The point is that, even if I work on  a subset of about 30000 lines to  
train my model, when I save it using pickle, I get a large file in the  
order of several hundreds of Mb of space (see the snippet at the end of  
the email).
I can then later load the model by doing the following

In [8]: pkl_file = open("rf_wallmart_holidays.txt")

In [9]: clf = pickle.load(pkl_file)

In [10]: pkl_file.close()

However, I am concerned thay when I use the whole dataset, I will get a  
model size of the order of several Gb and I wonder if I will be able to  
load it via pickle as I do above.
I am just wondering if I am making any gross mistake (I have never used  
pickle in the past).
Any suggestions about efficient ways to store/read the models developed  
with sklearn is appreciated.
Regards

Lorenzo


################################################################################


clf = RandomForestRegressor(n_estimators=150,\
                             # compute_importances = True, \
                             n_jobs=2, verbose=3)

sales=train.Weekly_Sales

my_cols = set(train.columns)

my_cols.remove("Weekly_Sales")

my_cols = list(my_cols)

clf.fit(train[my_cols], sales)



f = open('rf_wallmart_non_holidays.txt','wb')
pickle.dump(clf,f)

------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to