Another round of options to explore is the field of smart sampling e.g. the
SMOTE family [1] that either generates synthetic positive examples or tries
to sample the most interesting negative samples, se e.g. Tomek links [2].

Also if you optimize the hyper parameters via grid search you can provide
your own scoring function. In a recent experiment I did with imbalanced
data using accuracy on the minority class only (instead of simple accuracy)
managed to improve recall on the minority class from 0.03 to 0.35.

[1] http://arxiv.org/abs/1106.1813
[2]
http://adaptiveilp.googlecode.com/svn/trunk/imbalanced%20datasets%20survey%20paper%20gests.pdf

HTH

Eustache

2014-09-09 17:40 GMT+02:00 Albert Thomas <albertthoma...@gmail.com>:

> Have you tried using One-Class SVM to learn the minority class? I read
> somewhere that it can lead to better results than using both classes when
> you have heavily unbalanced data class proportions.
> Albert
>
> 2014-09-09 16:55 GMT+02:00 ZORAIDA HIDALGO SANCHEZ <
> zoraida.hidalgosanc...@telefonica.com>:
>
>>  I already did but when I test on my test dataset(unbalance) I get very
>> poor results.
>>
>>   De: Eustache DIEMERT <eusta...@diemert.fr>
>> Responder a: "scikit-learn-general@lists.sourceforge.net" <
>> scikit-learn-general@lists.sourceforge.net>
>> Fecha: martes, 9 de septiembre de 2014 16:33
>> Para: "scikit-learn-general@lists.sourceforge.net" <
>> scikit-learn-general@lists.sourceforge.net>
>> Asunto: Re: [Scikit-learn-general] SVC and unbalanced dataset
>>
>>   besides class weights you may try to downsample your negative
>> examples.
>>
>>  E/
>>
>> 2014-09-09 14:32 GMT+02:00 ZORAIDA HIDALGO SANCHEZ <
>> zoraida.hidalgosanc...@telefonica.com>:
>>
>>> Dear all,
>>>
>>> I am trying to classify a dataset with a binary target. Number of
>>> positive
>>> instances represents only the 3% of the total instances. I have tried
>>> using SVC with neither auto_weight nor sample_weight and the confusion
>>> matrix shows that all instances are classified as negative. However, if I
>>> use either auto_weight:auto or sample_weight(computing the weight of each
>>> instances proportional to the porcentaje of its target) then the
>>> confusion
>>> matrix is the other way around(that means, all instances are classified
>>> as
>>> positive).
>>>
>>> What am I doing wrong?
>>>
>>> This is how I have made the calls:
>>>
>>> 1) with no additional parameters:
>>> SVC(probability=True, max_iter=1000, verbose=5)
>>>
>>> 2) with class_weight:
>>> SVC(class_weight=Œauto¹, probability=True, max_iter=1000, verbose=5)
>>>
>>> 3) with sample_weight:
>>> classifier = SVC(probability=True, max_iter=1000, verbose=5)
>>>
>>> and later:
>>>
>>> sample_weight = np.asarray(compute_sample_weight(np.unique(y_train),
>>> y_train))
>>>             classifier.fit(X_train, y_train, sample_weight=sample_weight)
>>>
>>>
>>> def compute_sample_weight(classes, y_train):
>>>     # Find the weight of each class as present in y.
>>>     le = LabelEncoder()
>>>     y_ind = le.fit_transform(y_train)
>>>     if not all(np.in1d(classes, le.classes_)):
>>>         raise ValueError("classes should have valid labels that are in
>>> y")
>>>
>>>     # inversely proportional to the number of samples in the class
>>>     recip_freq = 1. / np.bincount(y_ind)
>>>     weight = recip_freq[le.transform(classes)] / np.mean(recip_freq)
>>>     weight_by_class = dict(zip(le.classes_, weight))
>>>     y_sample_weight = [weight_by_class[e] for e in y_train]
>>>     return y_sample_weight
>>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>> Z.-
>>>
>>>
>>> ________________________________
>>>
>>> Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
>>> puede contener información privilegiada o confidencial y es para uso
>>> exclusivo de la persona o entidad de destino. Si no es usted. el
>>> destinatario indicado, queda notificado de que la lectura, utilización,
>>> divulgación y/o copia sin autorización puede estar prohibida en virtud de
>>> la legislación vigente. Si ha recibido este mensaje por error, le rogamos
>>> que nos lo comunique inmediatamente por esta misma vía y proceda a su
>>> destrucción.
>>>
>>> The information contained in this transmission is privileged and
>>> confidential information intended only for the use of the individual or
>>> entity named above. If the reader of this message is not the intended
>>> recipient, you are hereby notified that any dissemination, distribution or
>>> copying of this communication is strictly prohibited. If you have received
>>> this transmission in error, do not read it. Please immediately reply to the
>>> sender that you have received this communication in error and then delete
>>> it.
>>>
>>> Esta mensagem e seus anexos se dirigem exclusivamente ao seu
>>> destinatário, pode conter informação privilegiada ou confidencial e é para
>>> uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o
>>> destinatário indicado, fica notificado de que a leitura, utilização,
>>> divulgação e/ou cópia sem autorização pode estar proibida em virtude da
>>> legislação vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos
>>> o comunique imediatamente por esta mesma via e proceda a sua destruição
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Want excitement?
>>> Manually upgrade your production database.
>>> When you want reliability, choose Perforce.
>>> Perforce version control. Predictably reliable.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>
>>
>> ------------------------------
>>
>> Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
>> puede contener información privilegiada o confidencial y es para uso
>> exclusivo de la persona o entidad de destino. Si no es usted. el
>> destinatario indicado, queda notificado de que la lectura, utilización,
>> divulgación y/o copia sin autorización puede estar prohibida en virtud de
>> la legislación vigente. Si ha recibido este mensaje por error, le rogamos
>> que nos lo comunique inmediatamente por esta misma vía y proceda a su
>> destrucción.
>>
>> The information contained in this transmission is privileged and
>> confidential information intended only for the use of the individual or
>> entity named above. If the reader of this message is not the intended
>> recipient, you are hereby notified that any dissemination, distribution or
>> copying of this communication is strictly prohibited. If you have received
>> this transmission in error, do not read it. Please immediately reply to the
>> sender that you have received this communication in error and then delete
>> it.
>>
>> Esta mensagem e seus anexos se dirigem exclusivamente ao seu
>> destinatário, pode conter informação privilegiada ou confidencial e é para
>> uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o
>> destinatário indicado, fica notificado de que a leitura, utilização,
>> divulgação e/ou cópia sem autorização pode estar proibida em virtude da
>> legislação vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos
>> o comunique imediatamente por esta mesma via e proceda a sua destruição
>>
>>
>> ------------------------------------------------------------------------------
>> Want excitement?
>> Manually upgrade your production database.
>> When you want reliability, choose Perforce.
>> Perforce version control. Predictably reliable.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> Want excitement?
> Manually upgrade your production database.
> When you want reliability, choose Perforce.
> Perforce version control. Predictably reliable.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to