Thanks to all for the tips on GridSearch with FeatureUnion, I'll be trying
those out today. And @amueller I've been following the development of your
PR for the random sampling of param space with great interest.

But back to the initial problem...it seems that an empty input is the
cause. My raw data is in a CSV file that looks like

"class_i", "input_0"
"class_j", "input_1"
[...]

(inputs are utf-8 strings)

It turns out there's an (or at least one) empty input:

"class_k", ""

For some reason, that's causing the failure when doing distributed grid
search. I know this for a fact because I narrowed it down to a small (100
inputs) file and tried it with and without this one input. n_jobs=1 ran
fine in both cases, but n_jobs=-1 only ran with that input removed. Once
again, the error being thrown looked like:

Process PoolWorker-1:
Traceback (most recent call last):
  File
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py",
line 232, in _bootstrap
    self.run()
  File
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py",
line 88, in run
    self._target(*self._args, **self._kwargs)
  File
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py",
line 59, in worker
    task = get()
  File
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/queues.py",
line 352, in get
    return recv()
TypeError: ('data type not understood', <type 'numpy.dtype'>, ('S0', 0, 1))
Process PoolWorker-2:
[...]

Maybe someone who knows the internals of GridSearchCV can shed some light
on what's happening here? If anything, I think I'd prefer this to fail in
both cases, or at least throw up a "empty input detected" warning of some
sort.



On 15 November 2012 23:26, Andreas Mueller <[email protected]> wrote:

> Sorry for not being able to help you with the actual problem, but
> another hint:
> I have a pull request for randomly sampling the parameter space, which
> should be much more efficient in a model with so many parameters.
>
> https://github.com/scikit-learn/scikit-learn/pull/1194
>
>
> ------------------------------------------------------------------------------
> Monitor your physical, virtual and cloud infrastructure from a single
> web console. Get in-depth insight into apps, servers, databases, vmware,
> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> Pricing starts from $795 for 25 servers or applications!
> http://p.sf.net/sfu/zoho_dev2dev_nov
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to