2013/10/3 Josh Wasserstein <[email protected]>: > Hello, > > I work in a classification problem where each instance has several > attributes (e.g. the age of an individual). However, collecting instances > (either labeled or unlabeled) is very expensive, since it requires asking > domain experts to spend a significant amount of time to simply collect the > instance (labeling the instance once it has been collected is actually > relatively fast) > > Given this, I want to explore an active learning strategy where rather than > starting with a set of labeled and unlabeled instances, I only have labeled > instances, but I can ask for additional labeled instances by specifying: > > Attributes or statistics of the attributes of the additional instances (e.g. > give me an instance with an age in the range [a,b]) on the new instances > The desired label of the additional instances (e.g. give me a new instance > with label x), or alternatively the label sampling distribution that the > experts should use get new instances. > > With this, my questions are: > > Does this problem have a name? It looks like a specific case of Active > Learning, but I am not sure, since in Active Learning one starts with a set > of unlabeled instances, which is not my case.
Indeed this is a rather unusual variant of active learning. I don't know whether it has a name. > What types of approaches (from the most rudimentary to the more > sophisticated) can I employ to identify the most informative sampling > distribution from instance attributes or instance labels? > > Does scikit-learn provide any functionality geared towards the specific Intuitively I would say it might be interesting to collect more samples close to the decision boundaries of a model trained on the past collected instances. Suppose you have a multiclass classification problem: - select a pair of classes A and B where the test accuracy of the binary classifier is bad (estimated with cross-validation) - train a binary classification model for this pair A, B - locate the decision boundary and collect samples with label either A or B that are close to the boundary, For instance for non-linear SVM model, you can try to collect new samples with label B that are close to support vectors with label A. "close" being defined in terms of the metric used by your kernel, e.g. Euclidean for the RBF kernel. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
