Dear All,
After my previous question on how to use GridSearchCV with
boosting (thanks Peter and Andy!), here is another related one.
I'd like to keep a detailed log of the results of each fold
of each assignment to the parameters during model selection
via GridSearchCV. Peter suggested to redefine the .score()
function of the regressor of interest so I am going along
that direction. Note that I am heavily using the parallelization
capabilities of GridSearchCV, i.e. joblib, so each evaluation of
the regressor lies in a different process.
I know that sharing a global variable, e.g. a list, where to
put all the results of the classifier instances does not work here
because each process has its own copy of that variable - in short
global variables does not work with multiprocess. My current attempt is with
sqlite + sqlalchemy + pickle , i.e. writing a very short layer
by which the detailed results of each GridSearchCV step
are first pickled and then transparently mapped into a sqlite db thanks
to sqlalchemy. SQlite can handle concurrent writing of different
process to the same db... so this solution should work. But it does not...
Unfortunately I get this exception:
---
/tmp/python-15120YlP.py in <module>()
56 clf = GridSearchCV(GradientBoostingRegressor(loss='ls',
random_state=seed),
param_grid=parameters, loss_func=None, n_jobs=-1, cv=n_folds, verbose=10)
57
---> 58 clf.fit(X, y)
59
/usr/lib/pymodules/python2.7/sklearn/grid_search.pyc in fit(self, X, y,
**params)
396 X, y, base_clf, clf_params, train, test,
self.loss_func,
397 self.score_func, self.verbose, **self.fit_params)
--> 398 for clf_params in grid for train, test in cv)
399
400 # Out is a list of triplet: score, estimator, n_test_samples
/usr/lib/pymodules/python2.7/joblib/parallel.pyc in __call__(self, iterable)
473 self.dispatch(function, args, kwargs)
474
--> 475 self.retrieve()
476 # Make sure that we get a last message telling us we are
done
477 elapsed_time = time.time() - self._start_time
/usr/lib/pymodules/python2.7/joblib/parallel.pyc in retrieve(self)
425 # Convert this to a JoblibException
426 exception_type = _mk_exception(exception.etype)[0]
--> 427 raise exception_type(report)
428 raise exception
429
JoblibUnmappedInstanceError: JoblibUnmappedInstanceError
___________________________________________________________________________
Class '__builtin__.unicode' is not mapped
___________________________________________________________________________
-----------
which I do not undestand. The issue is triggered by "session.add(result)"
that lies within the overridden score function of the regressor
as suggested y Peter in the previous thread.
I understand all this could sound confusing and not well explained.
I a working on a minimal example to expose the issue. In the meanwhile
I am trying this first brief attempt to capture your interest on
this problem. Maybe some of you immediately see the issue without
further explanation. If you spot it, could you please explain?
BTW, is there a preferred way to communicate between the subprocesses
and the father process within sklearn?
Best,
Emanuele
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general