Thanks Josh for your quick feedback ! It's quite helpful indeed .
Further to it , I am having another burning question. In my sample dataset
, I have 2 label columns (let's say x and y)
My objective is to give the labels within column 'x' 10 times more weight
as compared to labels within column y.
My question is the parameter class_weight={0: 1, 1: 10} works for a single
column, i.e., within a single column I have assigned 10 times weight to the
positive labels.
But my objective is to provide a 10 times weight to the positive labels
within column 'x' as compared to the positive labels within column 'y'.
May I please get a feedback from you around how to achieve this please.
Thanks for your help in advance !
On Mon, Jan 23, 2017 at 9:56 AM, Josh Vredevoogd <[email protected]>
wrote:
> If you undersample, taking only 10% of the negative class, the classifier
> will see different combinations of attributes and produce a different fit
> to explain those distributions. In the worse case, imagine you are
> classifying birds and through sampling you eliminate all `red` examples.
> Your classifier likely now will not understand that red objects can be
> birds. That's an overly simple example, but given a classifier capable of
> exploring and explaining feature combinations, less obvious versions of
> this are bound to happen.
>
> The extrapolation only works in the other direction: if you manually
> duplicate samples by the sampling factor, you should get the exact same fit
> as if you increased the class weight.
>
> Hope that helps,
> Josh
>
>
> On Sun, Jan 22, 2017 at 5:00 AM, Debabrata Ghosh <[email protected]>
> wrote:
>
>> Thanks Josh !
>>
>> I have used the parameter class_weight={0: 1, 1: 10} and the model code
>> has run successfully. However, just to get a further clarity around it's
>> concept, I am having another question for you please. I did the following 2
>> tests:
>>
>> 1. In my dataset , I have 1 million negative classes and 10,000 positive
>> classes. First I ran my model code without supplying any class_weight
>> parameter and it gave me certain True Positive and False Positive results.
>>
>> 2. Now in the second test, I had the same 1 million negative classes but
>> reduced the positive classes to 1000 . But this time, I supplied the
>> parameter class_weight={0: 1, 1: 10} and got my True Positive and False
>> Positive Results
>>
>> My question is , when I multiply the results obtained from my second test
>> with a factor of 10, I don't match with the results obtained from my first
>> test. In other words, say I get the true positive against a threshold from
>> the second test as 8 , while the true positive from the first test against
>> the same threshold is 260. I am getting similar observations for the false
>> positive results wherein if I multiply the results obtained in the second
>> test by 10, I don't come close to the results obtained from the first set.
>>
>> Is my expectation correct ? Is my way of executing the test (i.e.,
>> reducing the the positive classes by 10 times and then feeding a class
>> weight of 10 times the negative classes) and comparing the results with a
>> model run without any class weight parameter correct ?
>>
>> Please let me know as per your convenience as this will help me a big way
>> to understand the concept further.
>>
>> Thanks in advance !
>>
>> On Sun, Jan 22, 2017 at 1:56 AM, Josh Vredevoogd <[email protected]>
>> wrote:
>>
>>> The class_weight parameter doesn't behave the way you're expecting.
>>>
>>> The value in class_weight is the weight applied to each sample in that
>>> class - in your example, each class zero sample has weight 0.001 and each
>>> class one sample has weight 0.999, so each class one samples carries 999
>>> times the weight of a class zero sample.
>>>
>>> If you would like each class one sample to have ten times the weight,
>>> you would set `class_weight={0: 1, 1: 10}` or `class_weight={0:0.1, 1:1}`
>>> equivalently.
>>>
>>>
>>> On Sat, Jan 21, 2017 at 10:18 AM, Debabrata Ghosh <[email protected]
>>> > wrote:
>>>
>>>> Hi All,
>>>> Greetings !
>>>>
>>>> I have a very basic question regarding the usage of the
>>>> parameter class_weight in scikit learn's Random Forest Classifier's fit
>>>> method.
>>>>
>>>> I have a fairly unbalanced sample and my positive class :
>>>> negative class ratio is 1:100. In other words, I have a million records
>>>> corresponding to negative class and 10,000 records corresponding to
>>>> positive class. I have trained the random forest classifier model using the
>>>> above record set successfully.
>>>>
>>>> Further, for a different problem, I want to test the
>>>> parameter class_weight. So, I am setting the class_weight as [0:0.001 ,
>>>> 1:0.999] and I have tried running my model on the same dataset as mentioned
>>>> in the above paragraph but with the positive class records reduced to 1000
>>>> [because now each positive class is given approximately 10 times more
>>>> weight than a negative class]. However, the model run results are very very
>>>> different between the 2 runs (with and without class_weight). And I
>>>> expected a similar run results.
>>>>
>>>> Would you please be able to let me know where am I
>>>> getting wrong. I know it's something silly but just want to improve on my
>>>> concept.
>>>>
>>>> Thanks !
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> [email protected]
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> [email protected]
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn