> > Combined solution; > * if a feature contains strings -> categorical > * Frequency of distinct values - if they repeat enough (80% default) and > if it doesn't have decimal values, then it is a categorical feature.
+1 On Fri, Aug 14, 2015 at 12:13 PM, Nirmal Fernando <nir...@wso2.com> wrote: > Combined solution; > > * if a feature contains strings -> categorical > * Frequency of distinct values - if they repeat enough (80% default) and > if it doesn't have decimal values, then it is a categorical feature. > > On Fri, Aug 14, 2015 at 7:22 PM, Supun Sethunga <sup...@wso2.com> wrote: > >> Hi all, >> >> +1 for a hybrid solution. But still a -1 for using skewness even in the >> hybrid solution :D >> >> One good example why we shouldn't use skenwness is the income >> distribution graph in [1]. There, regardless of whether Im using the raw >> data (then its a continuous feature) or whether Im breaking them in to >> intervals and categorized the income in to several levels, I would get the >> same shape for the distribution. i.e skewness would be significant. >> >> So the point Im trying to make is, categorical features as well as a >> continuous features can be skewed/symmetric, and we cant really distinguish. >> >> [1] >> https://cdn2.vox-cdn.com/uploads/chorus_asset/file/2930990/Distribution_of_Annual_Household_Income_in_the_United_States_2012.0.png >> >> >> On Fri, Aug 14, 2015 at 1:03 AM, Nirmal Fernando <nir...@wso2.com> wrote: >> >>> Thanks Thushan. Good suggestion on the frequency. >>> >>> *solutions* >>> >>> 1. Categorical threshold: if # of distinct values are less than X, it is >>> a categorical feature. >>> 2. Make all features with only integers (no decimals) categorical. >>> 3. Skewness: if skewness of a distribution of a feature is less than X, >>> it is a categorical feature. >>> 4. Gaps between consecutive distinct values >>> 5. Frequency of distinct values - if they repeat enough, then it is a >>> categorical feature. >>> 6. Combined solution >>> >>> So, I guess as suggested by many of you, we need to build a combined >>> solution. >>> >>> On Fri, Aug 14, 2015 at 10:29 AM, Thushan Ganegedara <thu...@gmail.com> >>> wrote: >>> >>>> Moreover, I think a hybrid approach as follows might work well. >>>> >>>> 1. Select a sample >>>> >>>> 2. Filter columns by the data type and find potential categorical >>>> variables (integer / string) >>>> >>>> 3. Filter further by checking if same values are repeated multiple >>>> times in the dataset. >>>> >>>> On Fri, Aug 14, 2015 at 2:53 PM, Nirmal Fernando <nir...@wso2.com> >>>> wrote: >>>> >>>>> Thanks for all the input. >>>>> >>>>> So let me summarise; >>>>> >>>>> *the problem* >>>>> >>>>> * We need to determine whether a feature is a categorical one or not, >>>>> to draw certain graphs to explore a dataset, before a user starts to build >>>>> analyses (before user input). >>>>> * We can't get a 100% accuracy, hence it is of course a suggestion >>>>> that we do. >>>>> * Question is, what would be the most accurate method. >>>>> >>>>> *solutions* >>>>> >>>>> 1. Categorical threshold: if # of distinct values are less than X, it >>>>> is a categorical feature. >>>>> 2. Make all features with only integers (no decimals) categorical. >>>>> 3. Skewness: if skewness of a distribution of a feature is less than >>>>> X, it is a categorical feature. >>>>> 4. Gaps between consecutive distinct values >>>>> 5. Combined solution >>>>> >>>>> On Fri, Aug 14, 2015 at 9:33 AM, Maheshakya Wijewardena < >>>>> mahesha...@wso2.com> wrote: >>>>> >>>>>> Another approach to distinguish between categorical and numerical >>>>>> features can be elaborated as follows: >>>>>> >>>>>> First, we take out the unique values from the column and sort them. >>>>>> If it's a categorical feature, then the gaps between the elements of this >>>>>> sorted list should be equal. In a numerical feature, this is extremely >>>>>> unlikely to happen. This behavior of valid in most scenarios, but there >>>>>> are >>>>>> a few exceptions as well. eg: when a numerical ID is used as the >>>>>> categorical label - 19933, 19913, 18832, ... >>>>>> >>>>>> This is a very simple hack that can be easily implemented, but not a >>>>>> standard technique. >>>>>> >>>>>> WDYT? >>>>>> >>>>>> On Fri, Aug 14, 2015 at 8:55 AM, Srinath Perera <srin...@wso2.com> >>>>>> wrote: >>>>>> >>>>>>> I mean current approach and skewness? >>>>>>> >>>>>>> On Fri, Aug 14, 2015 at 8:54 AM, Srinath Perera <srin...@wso2.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Can we use a combination of both? >>>>>>>> >>>>>>>> On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga <sup...@wso2.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> When a dataset is large, in general its said to be approximates to >>>>>>>>> a Normal Distribution. :) True it Hypothetical, but the point they >>>>>>>>> make >>>>>>>>> is, when the datasets are large, then properties of a distribution >>>>>>>>> like >>>>>>>>> skewness, variance and etc. become closer to the properties Normal >>>>>>>>> Distribution in most cases.. >>>>>>>>> >>>>>>>>> On Thu, Aug 13, 2015 at 11:07 AM, Nirmal Fernando <nir...@wso2.com >>>>>>>>> > wrote: >>>>>>>>> >>>>>>>>>> Hi Supun, >>>>>>>>>> >>>>>>>>>> Thanks for the reply. >>>>>>>>>> >>>>>>>>>> On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga <sup...@wso2.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Nirmal, >>>>>>>>>>> >>>>>>>>>>> IMO don't think we would be able to use skewness in this case. >>>>>>>>>>> Skewness says how symmetric the distribution is. For example, if we >>>>>>>>>>> consider a numerical/continuous feature (not categorical) which is >>>>>>>>>>> Normally >>>>>>>>>>> Distributed, then the skewness would be 0. Also for a categorical >>>>>>>>>>> (encoded) >>>>>>>>>>> feature having a systematic distribution, then again the skewness >>>>>>>>>>> would be >>>>>>>>>>> 0. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> What's the probability of you see a normal distribution of a real >>>>>>>>>> dataset? IMO it's very less and also since what we're doing here is a >>>>>>>>>> suggestion, do you see it as an issue? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> We did have this concern at the beginning as well, regarding how >>>>>>>>>>> we could determine whether a feature is categorical or Continuous. >>>>>>>>>>> Usually >>>>>>>>>>> this is strictly dependent on the domain of the dataset (i.e. user >>>>>>>>>>> have to >>>>>>>>>>> decide this with the knowledge about the data). That was the idea >>>>>>>>>>> behind >>>>>>>>>>> letting user change the data type.. But since we needed a default >>>>>>>>>>> option, >>>>>>>>>>> we had to go for the threshold thing, which was the olny option we >>>>>>>>>>> could >>>>>>>>>>> come-up with. I did a bit of research on this too, but only to find >>>>>>>>>>> no >>>>>>>>>>> other solution :( >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Supun >>>>>>>>>>> >>>>>>>>>>> On Thu, Aug 13, 2015 at 1:49 AM, Nirmal Fernando < >>>>>>>>>>> nir...@wso2.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi All, >>>>>>>>>>>> >>>>>>>>>>>> We have a feature in ML where we suggest a given data column of >>>>>>>>>>>> a dataset is categorical or numerical. Currently, how we determine >>>>>>>>>>>> this is >>>>>>>>>>>> by using a threshold value (The maximum number of categories >>>>>>>>>>>> that can have in a non-string categorical feature. If exceeds, >>>>>>>>>>>> the feature will be treated as a numerical feature.). But this >>>>>>>>>>>> is not a successful measurement for most of the datasets. >>>>>>>>>>>> >>>>>>>>>>>> Can we use 'skewness' of a distribution as a measurement to >>>>>>>>>>>> determine this? Can we say, a column is numerical, if the modulus >>>>>>>>>>>> of the >>>>>>>>>>>> skewness of the distribution is less than a certain threshold (say >>>>>>>>>>>> 0.01) ? >>>>>>>>>>>> >>>>>>>>>>>> *References*: >>>>>>>>>>>> >>>>>>>>>>>> http://www.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%206%20Distributional%20Tests.htm >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>>> Thanks & regards, >>>>>>>>>>>> Nirmal >>>>>>>>>>>> >>>>>>>>>>>> Team Lead - WSO2 Machine Learner >>>>>>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>>>>>>>>> Mobile: +94715779733 >>>>>>>>>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> *Supun Sethunga* >>>>>>>>>>> Software Engineer >>>>>>>>>>> WSO2, Inc. >>>>>>>>>>> http://wso2.com/ >>>>>>>>>>> lean | enterprise | middleware >>>>>>>>>>> Mobile : +94 716546324 >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> Thanks & regards, >>>>>>>>>> Nirmal >>>>>>>>>> >>>>>>>>>> Team Lead - WSO2 Machine Learner >>>>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>>>>>>> Mobile: +94715779733 >>>>>>>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> *Supun Sethunga* >>>>>>>>> Software Engineer >>>>>>>>> WSO2, Inc. >>>>>>>>> http://wso2.com/ >>>>>>>>> lean | enterprise | middleware >>>>>>>>> Mobile : +94 716546324 >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> ============================ >>>>>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera >>>>>>>> Site: http://people.apache.org/~hemapani/ >>>>>>>> Photos: http://www.flickr.com/photos/hemapani/ >>>>>>>> Phone: 0772360902 >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> ============================ >>>>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera >>>>>>> Site: http://people.apache.org/~hemapani/ >>>>>>> Photos: http://www.flickr.com/photos/hemapani/ >>>>>>> Phone: 0772360902 >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Pruthuvi Maheshakya Wijewardena >>>>>> Software Engineer >>>>>> WSO2 : http://wso2.com/ >>>>>> Email: mahesha...@wso2.com >>>>>> Mobile: +94711228855 >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Thanks & regards, >>>>> Nirmal >>>>> >>>>> Team Lead - WSO2 Machine Learner >>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>> Mobile: +94715779733 >>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Dev mailing list >>>>> Dev@wso2.org >>>>> http://wso2.org/cgi-bin/mailman/listinfo/dev >>>>> >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> >>>> Thushan Ganegedara >>>> School of IT >>>> University of Sydney, Australia >>>> >>> >>> >>> >>> -- >>> >>> Thanks & regards, >>> Nirmal >>> >>> Team Lead - WSO2 Machine Learner >>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>> Mobile: +94715779733 >>> Blog: http://nirmalfdo.blogspot.com/ >>> >>> >>> >>> _______________________________________________ >>> Dev mailing list >>> Dev@wso2.org >>> http://wso2.org/cgi-bin/mailman/listinfo/dev >>> >>> >> >> >> -- >> *Supun Sethunga* >> Software Engineer >> WSO2, Inc. >> http://wso2.com/ >> lean | enterprise | middleware >> Mobile : +94 716546324 >> > > > > -- > > Thanks & regards, > Nirmal > > Team Lead - WSO2 Machine Learner > Associate Technical Lead - Data Technologies Team, WSO2 Inc. > Mobile: +94715779733 > Blog: http://nirmalfdo.blogspot.com/ > > > -- *Supun Sethunga* Software Engineer WSO2, Inc. http://wso2.com/ lean | enterprise | middleware Mobile : +94 716546324
_______________________________________________ Dev mailing list Dev@wso2.org http://wso2.org/cgi-bin/mailman/listinfo/dev