Re: [Scikit-learn-general] Tackling Dataset bias

Peter Prettenhofer Fri, 16 Aug 2013 21:09:50 -0700

In order to assess if dataset shift has indeed occurred I usually do the
following: create a classification task to distinguish between the two
datasets (eg the dataset from country A is the pos class, dataset from
country B is negative). Then I compare the classification loss (I usually
use training loss) to the loss I get if I create the task by just sampling
at random from dataset A.
I think the basic idea is also described in the blog post by Alex Smola
that Olivier mentioned and in some papers by John Blitzer on Domain
Adaptation (look for something like hellinger distance or alpha distance).


For tackling covariate shift there are two broad approaches that are often
used: a) importance weighting and b) representation learning. Both try to
make the two data distributions more similar. The first re-weights training
instances based on the data distribution in the test set. The latter tries
to find an embedding for train and test data under which both data
distributions are more similar. The first one is difficult to apply when
the support of train and test distributions are different which often is
the case for high dimensional data (eg NLP).

I'm currently travelling - once I'm at home I can send you more references.

best,
Peter
Am 15.08.2013 10:41 schrieb "Yogesh Karpate" <yogeshkarp...@gmail.com>:

> Hello Folks !
>                I have two different brain MR image databases acquired
> across two different countries. I need to perform patch based supervised
> binary classification task (+ pathology and - Normal). The 1st database
> contains both +pathology patients and -normal subjects whereas second
> database contains only +pathologysubjects. I have completed the
> classification task on 1st database. Now I want to analyze the second
> database. Since there is no normal/healthy data in later case, I need to
> use that data from 1st database. There exists a dataset bias. Well,
> following are the questions
> 1) Is it the classic case of  "co-variate shift adaptation"?
> 2) I went through  some papers by Sugiyama and Yamada "No Bias Left
> Behind: Co-variate Shift Adaptation". But I am not sure about the
> scalability of those algos to my very high dimensional MRI data. Some
> proposed the "Zero Shot Learning".
> 3) How to go about this problem? How do you people tackle data-set bias?
>
>
> All thoughts are welcome!
>
> --
>     Warm Regards
>     Yogesh Karpate
>
>
> ------------------------------------------------------------------------------
> Get 100% visibility into Java/.NET code with AppDynamics Lite!
> It's a free troubleshooting tool designed for production.
> Get down to code-level detail for bottlenecks, with <2% overhead.
> Download for free and get started troubleshooting in minutes.
> http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Tackling Dataset bias

Reply via email to