Greetings! When pickling a random forest fit, the storage requirements seem disproportionately large.It seems that that the space usage is dominated by the indices_ property on the DecisionTreeRegressor's in the estimators_.For what are these needed?It seems that one can do predictions after deleting them, and save a lot of space.Sample code and output below. Thanks!-Mike -----Sample Code-----#!/usr/bin/env python #c.f. http://scikit-learn-general.narkive.com/yJjAn9P2/pickled-random-forest-file-size-by-design import sklearn.ensemble, pickle N=500000toPredict = [[i % 6, i % 7, i % 8] for i in range(1000)] clf = sklearn.ensemble.RandomForestClassifier(n_estimators=128)clf.fit(X = [[i % 6, i % 7, i % 8] for i in range(N)], y=[i % 5 > 0 for i in range(N)]) size1 = len(pickle.dumps(clf))print("size1 = " + str(size1)) predict1 = clf.predict(toPredict)
for x in clf.estimators_: del x.indices_ size2 = len(pickle.dumps(clf))print("size2 = " + str(size2)) predict2 = clf.predict(toPredict) tot = (predict1 != predict2).sum()print("error = " + str(tot)) -----sample output------size1 = 67145826size2 = 3137874error = 0
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general