Re: [Scikit-learn-general] GridSearchCV/joblib and concurrent writing

Emanuele Olivetti Fri, 22 Jun 2012 10:11:27 -0700

Hi Peter,

(partial) answers below inline


On 06/22/2012 06:29 PM, Peter Prettenhofer wrote:
> Hi Emanuele,
>
> I cannot make any sense form the stack trace - have you tried to run
> GridSearchCV with n_jobs=1? if so, does it work OK?

I get a similar issue, but not identical. I am digging because it
is also related to sqlalchemy...

>
> When do you write the parameters into the DB - at the end of ``score``?

I write the parameters just after computing them in custom_score()
and before setting the model to the best iteration.

>
> It would be great if you could provide a gist which exposes the problem.

working on it

>
> BTW: it's strange that grid_search.py is using your "global" joblib
> (/usr/lib/pymodules/python2.7/joblib/) and not the one that comes with
> sklearn (in sklearn/externals/joblib)

NeuroDebian / Ubuntu 12.04 ships the package python-sklearn (v0.11)
without the internal copy of joblib. So there is just the global joblib.
Yarik's decision? ;-)


Thanks,

Emanuele





> best,
>   Peter
>
>
> 2012/6/22 Emanuele Olivetti<[email protected]>:
>> Dear All,
>>
>> After my previous question on how to use GridSearchCV with
>> boosting (thanks Peter and Andy!), here is another related one.
>>
>> I'd like to keep a detailed log of the results of each fold
>> of each assignment to the parameters during model selection
>> via GridSearchCV. Peter suggested to redefine the .score()
>> function of the regressor of interest so I am going along
>> that direction. Note that I am heavily using the parallelization
>> capabilities of GridSearchCV, i.e. joblib, so each evaluation of
>> the regressor lies in a different process.
>>
>> I know that sharing a global variable, e.g. a list, where to
>> put all the results of the classifier instances does not work here
>> because each process has its own copy of that variable - in short
>> global variables does not work with multiprocess. My current attempt is with
>> sqlite + sqlalchemy + pickle , i.e. writing a very short layer
>> by which the detailed results of each GridSearchCV step
>> are first pickled and then transparently mapped into a sqlite db thanks
>> to sqlalchemy. SQlite can handle concurrent writing of different
>> process to the same db... so this solution should work. But it does not...
>>
>> Unfortunately I get this exception:
>> ---
>> /tmp/python-15120YlP.py in<module>()
>>       56     clf = GridSearchCV(GradientBoostingRegressor(loss='ls', 
>> random_state=seed),
>> param_grid=parameters, loss_func=None, n_jobs=-1, cv=n_folds, verbose=10)
>>       57
>> --->  58     clf.fit(X, y)
>>       59
>>
>> /usr/lib/pymodules/python2.7/sklearn/grid_search.pyc in fit(self, X, y, 
>> **params)
>>      396                 X, y, base_clf, clf_params, train, test, 
>> self.loss_func,
>>      397                 self.score_func, self.verbose, **self.fit_params)
>> -->  398                     for clf_params in grid for train, test in cv)
>>      399
>>      400         # Out is a list of triplet: score, estimator, n_test_samples
>>
>>
>> /usr/lib/pymodules/python2.7/joblib/parallel.pyc in __call__(self, iterable)
>>      473                 self.dispatch(function, args, kwargs)
>>      474
>> -->  475             self.retrieve()
>>      476             # Make sure that we get a last message telling us we 
>> are done
>>
>>      477             elapsed_time = time.time() - self._start_time
>>
>> /usr/lib/pymodules/python2.7/joblib/parallel.pyc in retrieve(self)
>>      425                     # Convert this to a JoblibException
>>
>>      426                     exception_type = 
>> _mk_exception(exception.etype)[0]
>> -->  427                     raise exception_type(report)
>>      428                 raise exception
>>      429
>>
>> JoblibUnmappedInstanceError: JoblibUnmappedInstanceError
>> ___________________________________________________________________________
>> Class '__builtin__.unicode' is not mapped
>> ___________________________________________________________________________
>> -----------
>> which I do not undestand. The issue is triggered by "session.add(result)"
>> that lies within the overridden score function of the regressor
>> as suggested y Peter in the previous thread.
>>
>> I understand all this could sound confusing and not well explained.
>> I a working on a minimal example to expose the issue. In the meanwhile
>> I am trying this first brief attempt to capture your interest on
>> this problem. Maybe some of you immediately see the issue without
>> further explanation. If you spot it, could you please explain?
>>
>> BTW, is there a preferred way to communicate between the subprocesses
>> and the father process within sklearn?
>>
>> Best,
>>
>> Emanuele
>>


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] GridSearchCV/joblib and concurrent writing

Reply via email to