Any help someone might be able to provide with this error would be greatly appreciated.
I was trying to familiarize myself with sklearn's decision tree source code
by playing around with some of the lower level classes and functions with a
portion of the iris data set. I'm trying to build a single tree using
BestFirstTreeBuilder and I'm using code I borrowed from tree.py. The last
line in my code is creating the error:
builder.build(tree_, x, y)
The full error is:
Exception MemoryError: MemoryError() in 'sklearn.tree._tree.Tree._resize'
ignored
I've attached the full code I'm using (one page) and also pasted it below.
The version of sklearn that's being used is the pipped version rather than
the most recent version on github. Also trying this on Windows.
Anyone recognize this error?
Best, and thanks so much!
Ken
import numpy as np
from sklearn.tree._tree import BestFirstTreeBuilder
from sklearn import datasets
from sklearn.tree import _tree
from sklearn.tree._tree import Tree
from sklearn.utils import check_random_state
iris = datasets.load_iris()
x = iris.data[:, :2] # only the first two features.
y = iris.target
CRITERIA_CLF = {"gini": _tree.Gini, "entropy": _tree.Entropy}
CRITERIA_REG = {"mse": _tree.MSE, "friedman_mse": _tree.FriedmanMSE}
SPLITTERS = {"best": _tree.BestSplitter,
"presort-best": _tree.PresortBestSplitter,
"random": _tree.RandomSplitter}
criterion = "gini"
splitter = "best"
min_samples_split = 2
min_samples_leaf = 1
#min_weight_leaf = 0
max_depth = (2 ** 31) - 1
max_leaf_nodes = (2 ** 31) - 1
random_state = 0
random_state = check_random_state(random_state)
y = np.reshape(y, (-1, 1))
n_samples, n_features_ = x.shape
max_features_ = n_features_
n_outputs_ = y.shape[1]
n_classes_ = [1] * n_outputs_
n_classes_ = np.array(n_classes_, dtype=np.intp)
criterion = CRITERIA_CLF[criterion](n_outputs_,
n_classes_)
splitter = SPLITTERS[splitter](criterion,
max_features_,
min_samples_leaf,
random_state)
# Note: current pipped version of sklearn omits min_weight_leaf
builder = BestFirstTreeBuilder(splitter, min_samples_split,
min_samples_leaf,
#min_weight_leaf,
max_depth,
max_leaf_nodes)
tree_ = Tree(n_features_, n_classes_, n_outputs_)
builder.build(tree_, x, y)
On Sat, Sep 20, 2014 at 10:04 AM, Mathieu Blondel <[email protected]>
wrote:
> I read the RGF paper. Interesting method but definitely too early to
> include it in scikit-learn (we focus on mature algorithms). This also looks
> more complicated to implement than gradient boosting, since tree induction
> and boosting are interleaved.
>
> The paper also clarified what "fully corrective" means, thanks. To
> summarize, here are different strategies for setting the weights in
> gradient boosting:
> 1. using the same constant value (`learning_rate`) for all estimators
> 2. setting the weight of the last estimator by line search (other weights
> are kept fixed)
> 3. setting one separate weight per leaf node of the last estimator, by
> line search
> 4. refitting all estimators weights (including the past ones)
> 5. refitting all leaf nodes of all estimators?
>
> Some authors [1] argued that option 1 is sufficient in practice to get
> good performance since our goal is to do well on test data, not training
> data.
> Option 3 is what scikit-learn implements. Option 4 is the fully corrective
> case. It basically amounts to a least squares for square loss or to
> logisic regression for log loss (using each estimator predictions as
> features). One thing though is that our implementation doesn't store the
> estimator weights explicitly (leaves are updated directly). Setting a
> penalty (l1 or l2) on the estimator weights (i.e., on the functional)
> should prevent overfitting. Option 5 sounds pretty computationally
> expensive, although the update doesn't need to be done at every iteration.
>
> I recently re-implemented gradient boosting [2]. One difference in my
> implementation is that it is possible to use any base learner (not just
> trees). So my implementation stores estimator weights explicitly and uses
> option 2 above. The fully corrective updates (option 4) might be easier to
> add to my implementation.
>
> BTW, the notion of fully corrective updates is also present in the
> matching pursuit (-> orthogonal matching pursuit) and frank-wolfe
> literatures.
>
> Mathieu
>
> [1] "Boosting Algorithms: Regularization, Prediction and Model Fitting",
> Peter B ̈uhlmann and Torsten Hothorn (thanks to Peter for telling me about
> this paper)
>
> [2]
> https://github.com/mblondel/ivalice/blob/master/ivalice/impl/gradient_boosting.py
>
> Mathieu
>
> On Wed, Sep 17, 2014 at 4:02 AM, c TAKES <[email protected]> wrote:
>
>> yes - In fact my real goal is to implement RGF ultimately, though I had
>> considered building/forking off the current GradientBoostingRegressor
>> package as a starting point A) b/c I'm new to developing for scikit-learn
>> and B) to maintain as much consistency as possible with the rest of the
>> package.
>>
>> Upon further review though, I don't think there's much point in adding
>> fully corrective updates to GBR because without the regularization (the
>> rest of RGF) it is probably useless and leads to over fitting, as per
>> http://stat.rutgers.edu/home/tzhang/software/rgf/tpami14-rgf.pdf.
>>
>> So it would likely be more useful to go ahead and create RGF as an
>> entirely separate module. But that could take some time, of course.
>>
>> Is anyone working on RGF for sklearn?
>>
>> Arnaud, thanks for directing me to your sparse implementation, I will
>> have a look! It would certainly be excellent to have this work for all
>> decision tree algorithms.
>>
>> Ken
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Sep 16, 2014 at 11:16 AM, Peter Prettenhofer <
>> [email protected]> wrote:
>>
>>> The only reference I know is the Regularized Greedy Forest paper by
>>> Johnson and Zhang [1]
>>> I havent read the primary source (by Zhang as well).
>>>
>>> [1] http://arxiv.org/abs/1109.0887
>>>
>>> 2014-09-16 15:15 GMT+02:00 Mathieu Blondel <[email protected]>:
>>>
>>>> Could you give a reference for gradient boosting with fully corrective
>>>> updates?
>>>>
>>>> Since the philosophy of gradient boosting is to fit each tree against
>>>> the residuals (or negative gradient) so far, I am wondering how such fully
>>>> corrective update would work...
>>>>
>>>> Mathieu
>>>>
>>>> On Tue, Sep 16, 2014 at 9:16 AM, c TAKES <[email protected]> wrote:
>>>>
>>>>> Is anyone working on making Gradient Boosting Regressor work with
>>>>> sparse matrices?
>>>>>
>>>>> Or is anyone working on adding an option for fully corrective gradient
>>>>> boosting, I.E. all trees in the ensemble are re-weighted at each
>>>>> iteration?
>>>>>
>>>>> These are things I would like to see and may be able to help with if
>>>>> no one is currently working on them.
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Want excitement?
>>>>> Manually upgrade your production database.
>>>>> When you want reliability, choose Perforce.
>>>>> Perforce version control. Predictably reliable.
>>>>>
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Want excitement?
>>>> Manually upgrade your production database.
>>>> When you want reliability, choose Perforce.
>>>> Perforce version control. Predictably reliable.
>>>>
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> --
>>> Peter Prettenhofer
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Want excitement?
>>> Manually upgrade your production database.
>>> When you want reliability, choose Perforce.
>>> Perforce version control. Predictably reliable.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>
TestTree.py
Description: Binary data
------------------------------------------------------------------------------ Slashdot TV. Video for Nerds. Stuff that Matters. http://pubads.g.doubleclick.net/gampad/clk?id=160591471&iu=/4140/ostg.clktrk
_______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
