Another approach to distinguish between categorical and numerical features
can be elaborated as follows:

First, we take out the unique values from the column and sort them. If it's
a categorical feature, then the gaps between the elements of this sorted
list should be equal. In a numerical feature, this is extremely unlikely to
happen. This behavior of valid in most scenarios, but there are a few
exceptions as well. eg: when a numerical ID is used as the categorical
label - 19933, 19913, 18832, ...

This is a very simple hack that can be easily implemented, but not a
standard technique.

WDYT?

On Fri, Aug 14, 2015 at 8:55 AM, Srinath Perera <srin...@wso2.com> wrote:

> I mean current approach and skewness?
>
> On Fri, Aug 14, 2015 at 8:54 AM, Srinath Perera <srin...@wso2.com> wrote:
>
>> Can we use a combination of both?
>>
>> On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga <sup...@wso2.com> wrote:
>>
>>> When a dataset is large, in general its said to be approximates to a
>>> Normal Distribution. :)  True it Hypothetical, but the point they make is,
>>> when the datasets are large, then properties of a distribution like
>>> skewness, variance and etc. become closer to the properties Normal
>>> Distribution in most cases..
>>>
>>> On Thu, Aug 13, 2015 at 11:07 AM, Nirmal Fernando <nir...@wso2.com>
>>> wrote:
>>>
>>>> Hi Supun,
>>>>
>>>> Thanks for the reply.
>>>>
>>>> On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga <sup...@wso2.com>
>>>> wrote:
>>>>
>>>>> Hi Nirmal,
>>>>>
>>>>> IMO don't think we would be able to use skewness in this case.
>>>>> Skewness says how symmetric the distribution is. For example, if we
>>>>> consider a numerical/continuous feature (not categorical) which is 
>>>>> Normally
>>>>> Distributed, then the skewness would be 0. Also for a categorical 
>>>>> (encoded)
>>>>> feature having a systematic distribution, then again the skewness would be
>>>>> 0.
>>>>>
>>>>
>>>> What's the probability of you see a normal distribution of a real
>>>> dataset? IMO it's very less and also since what we're doing here is a
>>>> suggestion, do you see it as an issue?
>>>>
>>>>
>>>>>
>>>>> We did have this concern at the beginning as well, regarding how we
>>>>> could determine whether a feature is categorical or Continuous. Usually
>>>>> this is strictly dependent on the domain of the dataset (i.e. user have to
>>>>> decide this with the knowledge about the data). That was the idea behind
>>>>> letting user change the data type.. But since we needed a default option,
>>>>> we had to go for the threshold thing, which was the olny option we could
>>>>> come-up with. I did a bit of research on this too, but only to find no
>>>>> other solution :(
>>>>>
>>>>> Thanks,
>>>>> Supun
>>>>>
>>>>> On Thu, Aug 13, 2015 at 1:49 AM, Nirmal Fernando <nir...@wso2.com>
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> We have a feature in ML where we suggest a given data column of a
>>>>>> dataset is categorical or numerical. Currently, how we determine this is 
>>>>>> by
>>>>>> using a threshold value (The maximum number of categories that can
>>>>>> have in a non-string categorical feature. If exceeds, the feature
>>>>>> will be treated as a numerical feature.). But this is not a
>>>>>> successful measurement for most of the datasets.
>>>>>>
>>>>>> Can we use 'skewness' of a distribution as a measurement to determine
>>>>>> this? Can we say, a column is numerical, if the modulus of the skewness 
>>>>>> of
>>>>>> the distribution is less than a certain threshold (say 0.01) ?
>>>>>>
>>>>>> *References*:
>>>>>>
>>>>>> http://www.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%206%20Distributional%20Tests.htm
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Thanks & regards,
>>>>>> Nirmal
>>>>>>
>>>>>> Team Lead - WSO2 Machine Learner
>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>> Mobile: +94715779733
>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Supun Sethunga*
>>>>> Software Engineer
>>>>> WSO2, Inc.
>>>>> http://wso2.com/
>>>>> lean | enterprise | middleware
>>>>> Mobile : +94 716546324
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Thanks & regards,
>>>> Nirmal
>>>>
>>>> Team Lead - WSO2 Machine Learner
>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>> Mobile: +94715779733
>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Supun Sethunga*
>>> Software Engineer
>>> WSO2, Inc.
>>> http://wso2.com/
>>> lean | enterprise | middleware
>>> Mobile : +94 716546324
>>>
>>
>>
>>
>> --
>> ============================
>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>> Site: http://people.apache.org/~hemapani/
>> Photos: http://www.flickr.com/photos/hemapani/
>> Phone: 0772360902
>>
>
>
>
> --
> ============================
> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
> Site: http://people.apache.org/~hemapani/
> Photos: http://www.flickr.com/photos/hemapani/
> Phone: 0772360902
>



-- 
Pruthuvi Maheshakya Wijewardena
Software Engineer
WSO2 : http://wso2.com/
Email: mahesha...@wso2.com
Mobile: +94711228855
_______________________________________________
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev

Reply via email to