Re: [Dev] [ML] Categorical or Numerical column?

Srinath Perera Thu, 13 Aug 2015 20:25:42 -0700

Can we use a combination of both?

On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga <[email protected]> wrote:


> When a dataset is large, in general its said to be approximates to a
> Normal Distribution. :)  True it Hypothetical, but the point they make is,
> when the datasets are large, then properties of a distribution like
> skewness, variance and etc. become closer to the properties Normal
> Distribution in most cases..
>
> On Thu, Aug 13, 2015 at 11:07 AM, Nirmal Fernando <[email protected]> wrote:
>
>> Hi Supun,
>>
>> Thanks for the reply.
>>
>> On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga <[email protected]> wrote:
>>
>>> Hi Nirmal,
>>>
>>> IMO don't think we would be able to use skewness in this case. Skewness
>>> says how symmetric the distribution is. For example, if we consider a
>>> numerical/continuous feature (not categorical) which is Normally
>>> Distributed, then the skewness would be 0. Also for a categorical (encoded)
>>> feature having a systematic distribution, then again the skewness would be
>>> 0.
>>>
>>
>> What's the probability of you see a normal distribution of a real
>> dataset? IMO it's very less and also since what we're doing here is a
>> suggestion, do you see it as an issue?
>>
>>
>>>
>>> We did have this concern at the beginning as well, regarding how we
>>> could determine whether a feature is categorical or Continuous. Usually
>>> this is strictly dependent on the domain of the dataset (i.e. user have to
>>> decide this with the knowledge about the data). That was the idea behind
>>> letting user change the data type.. But since we needed a default option,
>>> we had to go for the threshold thing, which was the olny option we could
>>> come-up with. I did a bit of research on this too, but only to find no
>>> other solution :(
>>>
>>> Thanks,
>>> Supun
>>>
>>> On Thu, Aug 13, 2015 at 1:49 AM, Nirmal Fernando <[email protected]>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> We have a feature in ML where we suggest a given data column of a
>>>> dataset is categorical or numerical. Currently, how we determine this is by
>>>> using a threshold value (The maximum number of categories that can
>>>> have in a non-string categorical feature. If exceeds, the feature will
>>>> be treated as a numerical feature.). But this is not a successful
>>>> measurement for most of the datasets.
>>>>
>>>> Can we use 'skewness' of a distribution as a measurement to determine
>>>> this? Can we say, a column is numerical, if the modulus of the skewness of
>>>> the distribution is less than a certain threshold (say 0.01) ?
>>>>
>>>> *References*:
>>>>
>>>> http://www.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%206%20Distributional%20Tests.htm
>>>>
>>>> --
>>>>
>>>> Thanks & regards,
>>>> Nirmal
>>>>
>>>> Team Lead - WSO2 Machine Learner
>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>> Mobile: +94715779733
>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Supun Sethunga*
>>> Software Engineer
>>> WSO2, Inc.
>>> http://wso2.com/
>>> lean | enterprise | middleware
>>> Mobile : +94 716546324
>>>
>>
>>
>>
>> --
>>
>> Thanks & regards,
>> Nirmal
>>
>> Team Lead - WSO2 Machine Learner
>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>> Mobile: +94715779733
>> Blog: http://nirmalfdo.blogspot.com/
>>
>>
>>
>
>
> --
> *Supun Sethunga*
> Software Engineer
> WSO2, Inc.
> http://wso2.com/
> lean | enterprise | middleware
> Mobile : +94 716546324
>



-- 
============================
Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
Site: http://people.apache.org/~hemapani/
Photos: http://www.flickr.com/photos/hemapani/
Phone: 0772360902

_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev

Re: [Dev] [ML] Categorical or Numerical column?

Reply via email to