>
> Combined solution;
> * if a feature contains strings -> categorical
> * Frequency of distinct values - if they repeat enough (80% default) and
> if it doesn't have decimal values, then it is a categorical feature.


+1

On Fri, Aug 14, 2015 at 12:13 PM, Nirmal Fernando <nir...@wso2.com> wrote:

> Combined solution;
>
> * if a feature contains strings -> categorical
> * Frequency of distinct values - if they repeat enough (80% default) and
> if it doesn't have decimal values, then it is a categorical feature.
>
> On Fri, Aug 14, 2015 at 7:22 PM, Supun Sethunga <sup...@wso2.com> wrote:
>
>> Hi all,
>>
>> +1 for a hybrid solution. But still a -1 for using skewness even in the
>> hybrid solution :D
>>
>> One good example why we shouldn't use skenwness is the income
>> distribution graph in [1]. There, regardless of whether Im using the raw
>> data (then its a continuous feature) or whether Im breaking them in to
>> intervals and categorized the income in to several levels, I would get the
>> same shape for the distribution. i.e skewness would be significant.
>>
>> So the point  Im trying to make is, categorical features as well as a
>> continuous features can be skewed/symmetric, and we cant really distinguish.
>>
>> [1]
>> https://cdn2.vox-cdn.com/uploads/chorus_asset/file/2930990/Distribution_of_Annual_Household_Income_in_the_United_States_2012.0.png
>>
>>
>> On Fri, Aug 14, 2015 at 1:03 AM, Nirmal Fernando <nir...@wso2.com> wrote:
>>
>>> Thanks Thushan. Good suggestion on the frequency.
>>>
>>> *solutions*
>>>
>>> 1. Categorical threshold: if # of distinct values are less than X, it is
>>> a categorical feature.
>>> 2. Make all features with only integers (no decimals) categorical.
>>> 3. Skewness: if skewness of a distribution of a feature is less than X,
>>> it is a categorical feature.
>>> 4. Gaps between consecutive distinct values
>>> 5. Frequency of distinct values - if they repeat enough, then it is a
>>> categorical feature.
>>> 6. Combined solution
>>>
>>> So, I guess as suggested by many of you, we need to build a combined
>>> solution.
>>>
>>> On Fri, Aug 14, 2015 at 10:29 AM, Thushan Ganegedara <thu...@gmail.com>
>>> wrote:
>>>
>>>> Moreover, I think a hybrid approach as follows might work well.
>>>>
>>>> 1. Select a sample
>>>>
>>>> 2. Filter columns by the data type and find potential categorical
>>>> variables (integer / string)
>>>>
>>>> 3. Filter further by checking if same values are repeated multiple
>>>> times in the dataset.
>>>>
>>>> On Fri, Aug 14, 2015 at 2:53 PM, Nirmal Fernando <nir...@wso2.com>
>>>> wrote:
>>>>
>>>>> Thanks for all the input.
>>>>>
>>>>> So let me summarise;
>>>>>
>>>>> *the problem*
>>>>>
>>>>> * We need to determine whether a feature is a categorical one or not,
>>>>> to draw certain graphs to explore a dataset, before a user starts to build
>>>>> analyses (before user input).
>>>>> * We can't get a 100% accuracy, hence it is of course a suggestion
>>>>> that we do.
>>>>> * Question is, what would be the most accurate method.
>>>>>
>>>>> *solutions*
>>>>>
>>>>> 1. Categorical threshold: if # of distinct values are less than X, it
>>>>> is a categorical feature.
>>>>> 2. Make all features with only integers (no decimals) categorical.
>>>>> 3. Skewness: if skewness of a distribution of a feature is less than
>>>>> X, it is a categorical feature.
>>>>> 4. Gaps between consecutive distinct values
>>>>> 5. Combined solution
>>>>>
>>>>> On Fri, Aug 14, 2015 at 9:33 AM, Maheshakya Wijewardena <
>>>>> mahesha...@wso2.com> wrote:
>>>>>
>>>>>> Another approach to distinguish between categorical and numerical
>>>>>> features can be elaborated as follows:
>>>>>>
>>>>>> First, we take out the unique values from the column and sort them.
>>>>>> If it's a categorical feature, then the gaps between the elements of this
>>>>>> sorted list should be equal. In a numerical feature, this is extremely
>>>>>> unlikely to happen. This behavior of valid in most scenarios, but there 
>>>>>> are
>>>>>> a few exceptions as well. eg: when a numerical ID is used as the
>>>>>> categorical label - 19933, 19913, 18832, ...
>>>>>>
>>>>>> This is a very simple hack that can be easily implemented, but not a
>>>>>> standard technique.
>>>>>>
>>>>>> WDYT?
>>>>>>
>>>>>> On Fri, Aug 14, 2015 at 8:55 AM, Srinath Perera <srin...@wso2.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I mean current approach and skewness?
>>>>>>>
>>>>>>> On Fri, Aug 14, 2015 at 8:54 AM, Srinath Perera <srin...@wso2.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Can we use a combination of both?
>>>>>>>>
>>>>>>>> On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga <sup...@wso2.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> When a dataset is large, in general its said to be approximates to
>>>>>>>>> a Normal Distribution. :)  True it Hypothetical, but the point they 
>>>>>>>>> make
>>>>>>>>> is, when the datasets are large, then properties of a distribution 
>>>>>>>>> like
>>>>>>>>> skewness, variance and etc. become closer to the properties Normal
>>>>>>>>> Distribution in most cases..
>>>>>>>>>
>>>>>>>>> On Thu, Aug 13, 2015 at 11:07 AM, Nirmal Fernando <nir...@wso2.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hi Supun,
>>>>>>>>>>
>>>>>>>>>> Thanks for the reply.
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga <sup...@wso2.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Nirmal,
>>>>>>>>>>>
>>>>>>>>>>> IMO don't think we would be able to use skewness in this case.
>>>>>>>>>>> Skewness says how symmetric the distribution is. For example, if we
>>>>>>>>>>> consider a numerical/continuous feature (not categorical) which is 
>>>>>>>>>>> Normally
>>>>>>>>>>> Distributed, then the skewness would be 0. Also for a categorical 
>>>>>>>>>>> (encoded)
>>>>>>>>>>> feature having a systematic distribution, then again the skewness 
>>>>>>>>>>> would be
>>>>>>>>>>> 0.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> What's the probability of you see a normal distribution of a real
>>>>>>>>>> dataset? IMO it's very less and also since what we're doing here is a
>>>>>>>>>> suggestion, do you see it as an issue?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We did have this concern at the beginning as well, regarding how
>>>>>>>>>>> we could determine whether a feature is categorical or Continuous. 
>>>>>>>>>>> Usually
>>>>>>>>>>> this is strictly dependent on the domain of the dataset (i.e. user 
>>>>>>>>>>> have to
>>>>>>>>>>> decide this with the knowledge about the data). That was the idea 
>>>>>>>>>>> behind
>>>>>>>>>>> letting user change the data type.. But since we needed a default 
>>>>>>>>>>> option,
>>>>>>>>>>> we had to go for the threshold thing, which was the olny option we 
>>>>>>>>>>> could
>>>>>>>>>>> come-up with. I did a bit of research on this too, but only to find 
>>>>>>>>>>> no
>>>>>>>>>>> other solution :(
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Supun
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Aug 13, 2015 at 1:49 AM, Nirmal Fernando <
>>>>>>>>>>> nir...@wso2.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> We have a feature in ML where we suggest a given data column of
>>>>>>>>>>>> a dataset is categorical or numerical. Currently, how we determine 
>>>>>>>>>>>> this is
>>>>>>>>>>>> by using a threshold value (The maximum number of categories
>>>>>>>>>>>> that can have in a non-string categorical feature. If exceeds,
>>>>>>>>>>>> the feature will be treated as a numerical feature.). But this
>>>>>>>>>>>> is not a successful measurement for most of the datasets.
>>>>>>>>>>>>
>>>>>>>>>>>> Can we use 'skewness' of a distribution as a measurement to
>>>>>>>>>>>> determine this? Can we say, a column is numerical, if the modulus 
>>>>>>>>>>>> of the
>>>>>>>>>>>> skewness of the distribution is less than a certain threshold (say 
>>>>>>>>>>>> 0.01) ?
>>>>>>>>>>>>
>>>>>>>>>>>> *References*:
>>>>>>>>>>>>
>>>>>>>>>>>> http://www.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%206%20Distributional%20Tests.htm
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks & regards,
>>>>>>>>>>>> Nirmal
>>>>>>>>>>>>
>>>>>>>>>>>> Team Lead - WSO2 Machine Learner
>>>>>>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>>>>>>>> Mobile: +94715779733
>>>>>>>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> *Supun Sethunga*
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> WSO2, Inc.
>>>>>>>>>>> http://wso2.com/
>>>>>>>>>>> lean | enterprise | middleware
>>>>>>>>>>> Mobile : +94 716546324
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Thanks & regards,
>>>>>>>>>> Nirmal
>>>>>>>>>>
>>>>>>>>>> Team Lead - WSO2 Machine Learner
>>>>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>>>>>> Mobile: +94715779733
>>>>>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Supun Sethunga*
>>>>>>>>> Software Engineer
>>>>>>>>> WSO2, Inc.
>>>>>>>>> http://wso2.com/
>>>>>>>>> lean | enterprise | middleware
>>>>>>>>> Mobile : +94 716546324
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> ============================
>>>>>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>>>>>>> Site: http://people.apache.org/~hemapani/
>>>>>>>> Photos: http://www.flickr.com/photos/hemapani/
>>>>>>>> Phone: 0772360902
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> ============================
>>>>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>>>>>> Site: http://people.apache.org/~hemapani/
>>>>>>> Photos: http://www.flickr.com/photos/hemapani/
>>>>>>> Phone: 0772360902
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Pruthuvi Maheshakya Wijewardena
>>>>>> Software Engineer
>>>>>> WSO2 : http://wso2.com/
>>>>>> Email: mahesha...@wso2.com
>>>>>> Mobile: +94711228855
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Thanks & regards,
>>>>> Nirmal
>>>>>
>>>>> Team Lead - WSO2 Machine Learner
>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>> Mobile: +94715779733
>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Dev mailing list
>>>>> Dev@wso2.org
>>>>> http://wso2.org/cgi-bin/mailman/listinfo/dev
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>>
>>>> Thushan Ganegedara
>>>> School of IT
>>>> University of Sydney, Australia
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Thanks & regards,
>>> Nirmal
>>>
>>> Team Lead - WSO2 Machine Learner
>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>> Mobile: +94715779733
>>> Blog: http://nirmalfdo.blogspot.com/
>>>
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list
>>> Dev@wso2.org
>>> http://wso2.org/cgi-bin/mailman/listinfo/dev
>>>
>>>
>>
>>
>> --
>> *Supun Sethunga*
>> Software Engineer
>> WSO2, Inc.
>> http://wso2.com/
>> lean | enterprise | middleware
>> Mobile : +94 716546324
>>
>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>


-- 
*Supun Sethunga*
Software Engineer
WSO2, Inc.
http://wso2.com/
lean | enterprise | middleware
Mobile : +94 716546324
_______________________________________________
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev

Reply via email to