Moreover, I think a hybrid approach as follows might work well.

1. Select a sample

2. Filter columns by the data type and find potential categorical variables
(integer / string)

3. Filter further by checking if same values are repeated multiple times in
the dataset.

On Fri, Aug 14, 2015 at 2:53 PM, Nirmal Fernando <nir...@wso2.com> wrote:

> Thanks for all the input.
>
> So let me summarise;
>
> *the problem*
>
> * We need to determine whether a feature is a categorical one or not, to
> draw certain graphs to explore a dataset, before a user starts to build
> analyses (before user input).
> * We can't get a 100% accuracy, hence it is of course a suggestion that we
> do.
> * Question is, what would be the most accurate method.
>
> *solutions*
>
> 1. Categorical threshold: if # of distinct values are less than X, it is a
> categorical feature.
> 2. Make all features with only integers (no decimals) categorical.
> 3. Skewness: if skewness of a distribution of a feature is less than X, it
> is a categorical feature.
> 4. Gaps between consecutive distinct values
> 5. Combined solution
>
> On Fri, Aug 14, 2015 at 9:33 AM, Maheshakya Wijewardena <
> mahesha...@wso2.com> wrote:
>
>> Another approach to distinguish between categorical and numerical
>> features can be elaborated as follows:
>>
>> First, we take out the unique values from the column and sort them. If
>> it's a categorical feature, then the gaps between the elements of this
>> sorted list should be equal. In a numerical feature, this is extremely
>> unlikely to happen. This behavior of valid in most scenarios, but there are
>> a few exceptions as well. eg: when a numerical ID is used as the
>> categorical label - 19933, 19913, 18832, ...
>>
>> This is a very simple hack that can be easily implemented, but not a
>> standard technique.
>>
>> WDYT?
>>
>> On Fri, Aug 14, 2015 at 8:55 AM, Srinath Perera <srin...@wso2.com> wrote:
>>
>>> I mean current approach and skewness?
>>>
>>> On Fri, Aug 14, 2015 at 8:54 AM, Srinath Perera <srin...@wso2.com>
>>> wrote:
>>>
>>>> Can we use a combination of both?
>>>>
>>>> On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga <sup...@wso2.com>
>>>> wrote:
>>>>
>>>>> When a dataset is large, in general its said to be approximates to a
>>>>> Normal Distribution. :)  True it Hypothetical, but the point they make is,
>>>>> when the datasets are large, then properties of a distribution like
>>>>> skewness, variance and etc. become closer to the properties Normal
>>>>> Distribution in most cases..
>>>>>
>>>>> On Thu, Aug 13, 2015 at 11:07 AM, Nirmal Fernando <nir...@wso2.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Supun,
>>>>>>
>>>>>> Thanks for the reply.
>>>>>>
>>>>>> On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga <sup...@wso2.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Nirmal,
>>>>>>>
>>>>>>> IMO don't think we would be able to use skewness in this case.
>>>>>>> Skewness says how symmetric the distribution is. For example, if we
>>>>>>> consider a numerical/continuous feature (not categorical) which is 
>>>>>>> Normally
>>>>>>> Distributed, then the skewness would be 0. Also for a categorical 
>>>>>>> (encoded)
>>>>>>> feature having a systematic distribution, then again the skewness would 
>>>>>>> be
>>>>>>> 0.
>>>>>>>
>>>>>>
>>>>>> What's the probability of you see a normal distribution of a real
>>>>>> dataset? IMO it's very less and also since what we're doing here is a
>>>>>> suggestion, do you see it as an issue?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> We did have this concern at the beginning as well, regarding how we
>>>>>>> could determine whether a feature is categorical or Continuous. Usually
>>>>>>> this is strictly dependent on the domain of the dataset (i.e. user have 
>>>>>>> to
>>>>>>> decide this with the knowledge about the data). That was the idea behind
>>>>>>> letting user change the data type.. But since we needed a default 
>>>>>>> option,
>>>>>>> we had to go for the threshold thing, which was the olny option we could
>>>>>>> come-up with. I did a bit of research on this too, but only to find no
>>>>>>> other solution :(
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Supun
>>>>>>>
>>>>>>> On Thu, Aug 13, 2015 at 1:49 AM, Nirmal Fernando <nir...@wso2.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> We have a feature in ML where we suggest a given data column of a
>>>>>>>> dataset is categorical or numerical. Currently, how we determine this 
>>>>>>>> is by
>>>>>>>> using a threshold value (The maximum number of categories that can
>>>>>>>> have in a non-string categorical feature. If exceeds, the feature
>>>>>>>> will be treated as a numerical feature.). But this is not a
>>>>>>>> successful measurement for most of the datasets.
>>>>>>>>
>>>>>>>> Can we use 'skewness' of a distribution as a measurement to
>>>>>>>> determine this? Can we say, a column is numerical, if the modulus of 
>>>>>>>> the
>>>>>>>> skewness of the distribution is less than a certain threshold (say 
>>>>>>>> 0.01) ?
>>>>>>>>
>>>>>>>> *References*:
>>>>>>>>
>>>>>>>> http://www.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%206%20Distributional%20Tests.htm
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Thanks & regards,
>>>>>>>> Nirmal
>>>>>>>>
>>>>>>>> Team Lead - WSO2 Machine Learner
>>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>>>> Mobile: +94715779733
>>>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Supun Sethunga*
>>>>>>> Software Engineer
>>>>>>> WSO2, Inc.
>>>>>>> http://wso2.com/
>>>>>>> lean | enterprise | middleware
>>>>>>> Mobile : +94 716546324
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Thanks & regards,
>>>>>> Nirmal
>>>>>>
>>>>>> Team Lead - WSO2 Machine Learner
>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>> Mobile: +94715779733
>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Supun Sethunga*
>>>>> Software Engineer
>>>>> WSO2, Inc.
>>>>> http://wso2.com/
>>>>> lean | enterprise | middleware
>>>>> Mobile : +94 716546324
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ============================
>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>>> Site: http://people.apache.org/~hemapani/
>>>> Photos: http://www.flickr.com/photos/hemapani/
>>>> Phone: 0772360902
>>>>
>>>
>>>
>>>
>>> --
>>> ============================
>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>> Site: http://people.apache.org/~hemapani/
>>> Photos: http://www.flickr.com/photos/hemapani/
>>> Phone: 0772360902
>>>
>>
>>
>>
>> --
>> Pruthuvi Maheshakya Wijewardena
>> Software Engineer
>> WSO2 : http://wso2.com/
>> Email: mahesha...@wso2.com
>> Mobile: +94711228855
>>
>>
>>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>
> _______________________________________________
> Dev mailing list
> Dev@wso2.org
> http://wso2.org/cgi-bin/mailman/listinfo/dev
>
>


-- 
Regards,

Thushan Ganegedara
School of IT
University of Sydney, Australia
_______________________________________________
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev

Reply via email to