Hi everybody,

I'm getting a segmentation fault when trying to fit a big dataset. Fitting
10 million (or less) 200 dimensional features is fine, but 12 million (or
more) make it crash. Any idea why would that happen? I didn't see the
computer short on memory in the process, although I wasn't able to check
just before the segfault.
The following code snippet reproduces the error (my computer is an Intel
Xeon X5672 with 128GB ram):

import numpy as np
n_feats = 12000000
n_subset = 10000000
feat_dims = 200
gt_dims = 3
y = np.random.randint(low=0, high=2, size=(n_feats,
gt_dims)).astype(np.uint8)
X = (y.dot(np.random.randn(gt_dims, feat_dims)) +
1e-2*np.random.randn(n_feats, feat_dims)).astype(np.float32)

print('generated training')
clf = RandomForestClassifier(n_estimators=3, min_samples_leaf=1)
clf.fit(X[:n_subset], y[:n_subset])
print('This should be reached')
clf.fit(X, y)
print('Segmentation fault before this')

PS: The stack after the segmentation fault looks like follows:

#0  __pyx_f_7sklearn_4tree_5_tree_4Tree_find_best_split
(__pyx_v_self=<optimized out>, __pyx_v_X_ptr=0x7ffb5f07d010,
__pyx_v_X_stride=12000000, __pyx_v_X_argsorted_ptr=0x7ff9221c4010,
__pyx_v_X_argsorted_stride=12000000,
    __pyx_v_y_ptr=0x7ff8f7e8e010, __pyx_v_y_stride=3,
__pyx_v_sample_weight_ptr=0x7fffe507c010, __pyx_v_sample_mask_ptr=0x1876810
"\001", __pyx_v_n_node_samples=7586194,
__pyx_v_weighted_n_node_samples=-0.14411123842000961,
    __pyx_v_n_total_samples=12000000, __pyx_v__best_i=0x7fffffffcb9c,
__pyx_v__best_t=0x7fffffffcb68, __pyx_v__best_error=0x7fffffffcb70,
__pyx_v__initial_error=0x7fffffffcb78) at sklearn/tree/_tree.c:6708
#1  0x00007fffeae45cb9 in
__pyx_f_7sklearn_4tree_5_tree_4Tree_recursive_partition
(__pyx_v_self=0x17a33b0, __pyx_v_X=0x15f7b50,
__pyx_v_X_argsorted=0x1646e00, __pyx_v_y=0x14be710,
__pyx_v_sample_weight=0x16e0660,
    __pyx_v_sample_mask=0x1609760, __pyx_v_n_node_samples=7586194,
__pyx_v_weighted_n_node_samples=12000000, __pyx_v_depth=0,
__pyx_v_parent=-1, __pyx_v_is_left_child=0, __pyx_v_buffer_value=0x160a750)
at sklearn/tree/_tree.c:5237
#2  0x00007fffeae4d4b7 in __pyx_f_7sklearn_4tree_5_tree_4Tree_build
(__pyx_v_self=0x17a33b0, __pyx_v_X=0x15f7b50, __pyx_v_y=0x14be710,
__pyx_skip_dispatch=<optimized out>, __pyx_optional_args=<optimized out>)
at sklearn/tree/_tree.c:4639
#3  0x00007fffeae38381 in __pyx_pf_7sklearn_4tree_5_tree_4Tree_10build
(__pyx_v_sample_weight=0x16e0660, __pyx_v_X_argsorted=0x1646e00,
__pyx_v_sample_mask=0x1609760, __pyx_v_y=0x14be710, __pyx_v_X=0x15f7b50,
__pyx_v_self=0x17a33b0)
    at sklearn/tree/_tree.c:4826
#4  __pyx_pw_7sklearn_4tree_5_tree_4Tree_11build (__pyx_v_self=0x17a33b0,
__pyx_args=0x1820368, __pyx_kwds=<optimized out>) at
sklearn/tree/_tree.c:4795
#5  0x000000000049d585 in PyEval_EvalFrameEx ()
#6  0x000000000049f1c0 in PyEval_EvalCodeEx ()
#7  0x00000000004983b8 in PyEval_EvalFrameEx ()
#8  0x000000000049f1c0 in PyEval_EvalCodeEx ()
#9  0x00000000004a8a92 in ?? ()
#10 0x00000000004e9f36 in PyObject_Call ()
#11 0x0000000000499bc0 in PyEval_EvalFrameEx ()
#12 0x000000000049f1c0 in PyEval_EvalCodeEx ()
#13 0x00000000004a8960 in ?? ()
#14 0x00000000004e9f36 in PyObject_Call ()
#15 0x00000000004ec11a in ?? ()
#16 0x00000000004e9f36 in PyObject_Call ()
#17 0x00000000004eb39e in ?? ()
#18 0x00000000004db6a6 in ?? ()
#19 0x00000000004e9f36 in PyObject_Call ()
#20 0x000000000049846a in PyEval_EvalFrameEx ()
#21 0x0000000000498602 in PyEval_EvalFrameEx ()
#22 0x000000000049f1c0 in PyEval_EvalCodeEx ()
#23 0x00000000004a8960 in ?? ()
#24 0x00000000004e9f36 in PyObject_Call ()
#25 0x00000000004ec11a in ?? ()
#26 0x00000000004e9f36 in PyObject_Call ()
#27 0x00000000004eb62e in ?? ()
#28 0x00000000004e9f36 in PyObject_Call ()
#29 0x000000000049846a in PyEval_EvalFrameEx ()
#30 0x000000000049f1c0 in PyEval_EvalCodeEx ()
#31 0x00000000004983b8 in PyEval_EvalFrameEx ()
#32 0x000000000049f1c0 in PyEval_EvalCodeEx ()
#33 0x00000000004a9081 in PyRun_FileExFlags ()
#34 0x00000000004a9311 in PyRun_SimpleFileExFlags ()
#35 0x00000000004aa8bd in Py_Main ()
#36 0x00007ffff68e176d in __libc_start_main () from
/lib/x86_64-linux-gnu/libc.so.6
#37 0x000000000041b9b1 in _start ()
-- Javier Romero --
------------------------------------------------------------------------------
Own the Future-Intel(R) Level Up Game Demo Contest 2013
Rise to greatness in Intel's independent game demo contest. Compete 
for recognition, cash, and the chance to get your game on Steam. 
$5K grand prize plus 10 genre and skill prizes. Submit your demo 
by 6/6/13. http://altfarm.mediaplex.com/ad/ck/12124-176961-30367-2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to