Thanks Thushan. Good suggestion on the frequency. *solutions*
1. Categorical threshold: if # of distinct values are less than X, it is a categorical feature. 2. Make all features with only integers (no decimals) categorical. 3. Skewness: if skewness of a distribution of a feature is less than X, it is a categorical feature. 4. Gaps between consecutive distinct values 5. Frequency of distinct values - if they repeat enough, then it is a categorical feature. 6. Combined solution So, I guess as suggested by many of you, we need to build a combined solution. On Fri, Aug 14, 2015 at 10:29 AM, Thushan Ganegedara <[email protected]> wrote: > Moreover, I think a hybrid approach as follows might work well. > > 1. Select a sample > > 2. Filter columns by the data type and find potential categorical > variables (integer / string) > > 3. Filter further by checking if same values are repeated multiple times > in the dataset. > > On Fri, Aug 14, 2015 at 2:53 PM, Nirmal Fernando <[email protected]> wrote: > >> Thanks for all the input. >> >> So let me summarise; >> >> *the problem* >> >> * We need to determine whether a feature is a categorical one or not, to >> draw certain graphs to explore a dataset, before a user starts to build >> analyses (before user input). >> * We can't get a 100% accuracy, hence it is of course a suggestion that >> we do. >> * Question is, what would be the most accurate method. >> >> *solutions* >> >> 1. Categorical threshold: if # of distinct values are less than X, it is >> a categorical feature. >> 2. Make all features with only integers (no decimals) categorical. >> 3. Skewness: if skewness of a distribution of a feature is less than X, >> it is a categorical feature. >> 4. Gaps between consecutive distinct values >> 5. Combined solution >> >> On Fri, Aug 14, 2015 at 9:33 AM, Maheshakya Wijewardena < >> [email protected]> wrote: >> >>> Another approach to distinguish between categorical and numerical >>> features can be elaborated as follows: >>> >>> First, we take out the unique values from the column and sort them. If >>> it's a categorical feature, then the gaps between the elements of this >>> sorted list should be equal. In a numerical feature, this is extremely >>> unlikely to happen. This behavior of valid in most scenarios, but there are >>> a few exceptions as well. eg: when a numerical ID is used as the >>> categorical label - 19933, 19913, 18832, ... >>> >>> This is a very simple hack that can be easily implemented, but not a >>> standard technique. >>> >>> WDYT? >>> >>> On Fri, Aug 14, 2015 at 8:55 AM, Srinath Perera <[email protected]> >>> wrote: >>> >>>> I mean current approach and skewness? >>>> >>>> On Fri, Aug 14, 2015 at 8:54 AM, Srinath Perera <[email protected]> >>>> wrote: >>>> >>>>> Can we use a combination of both? >>>>> >>>>> On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga <[email protected]> >>>>> wrote: >>>>> >>>>>> When a dataset is large, in general its said to be approximates to a >>>>>> Normal Distribution. :) True it Hypothetical, but the point they make >>>>>> is, >>>>>> when the datasets are large, then properties of a distribution like >>>>>> skewness, variance and etc. become closer to the properties Normal >>>>>> Distribution in most cases.. >>>>>> >>>>>> On Thu, Aug 13, 2015 at 11:07 AM, Nirmal Fernando <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Supun, >>>>>>> >>>>>>> Thanks for the reply. >>>>>>> >>>>>>> On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Nirmal, >>>>>>>> >>>>>>>> IMO don't think we would be able to use skewness in this case. >>>>>>>> Skewness says how symmetric the distribution is. For example, if we >>>>>>>> consider a numerical/continuous feature (not categorical) which is >>>>>>>> Normally >>>>>>>> Distributed, then the skewness would be 0. Also for a categorical >>>>>>>> (encoded) >>>>>>>> feature having a systematic distribution, then again the skewness >>>>>>>> would be >>>>>>>> 0. >>>>>>>> >>>>>>> >>>>>>> What's the probability of you see a normal distribution of a real >>>>>>> dataset? IMO it's very less and also since what we're doing here is a >>>>>>> suggestion, do you see it as an issue? >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> We did have this concern at the beginning as well, regarding how we >>>>>>>> could determine whether a feature is categorical or Continuous. Usually >>>>>>>> this is strictly dependent on the domain of the dataset (i.e. user >>>>>>>> have to >>>>>>>> decide this with the knowledge about the data). That was the idea >>>>>>>> behind >>>>>>>> letting user change the data type.. But since we needed a default >>>>>>>> option, >>>>>>>> we had to go for the threshold thing, which was the olny option we >>>>>>>> could >>>>>>>> come-up with. I did a bit of research on this too, but only to find no >>>>>>>> other solution :( >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Supun >>>>>>>> >>>>>>>> On Thu, Aug 13, 2015 at 1:49 AM, Nirmal Fernando <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi All, >>>>>>>>> >>>>>>>>> We have a feature in ML where we suggest a given data column of a >>>>>>>>> dataset is categorical or numerical. Currently, how we determine this >>>>>>>>> is by >>>>>>>>> using a threshold value (The maximum number of categories that >>>>>>>>> can have in a non-string categorical feature. If exceeds, the >>>>>>>>> feature will be treated as a numerical feature.). But this is not >>>>>>>>> a successful measurement for most of the datasets. >>>>>>>>> >>>>>>>>> Can we use 'skewness' of a distribution as a measurement to >>>>>>>>> determine this? Can we say, a column is numerical, if the modulus of >>>>>>>>> the >>>>>>>>> skewness of the distribution is less than a certain threshold (say >>>>>>>>> 0.01) ? >>>>>>>>> >>>>>>>>> *References*: >>>>>>>>> >>>>>>>>> http://www.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%206%20Distributional%20Tests.htm >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> Thanks & regards, >>>>>>>>> Nirmal >>>>>>>>> >>>>>>>>> Team Lead - WSO2 Machine Learner >>>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>>>>>> Mobile: +94715779733 >>>>>>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> *Supun Sethunga* >>>>>>>> Software Engineer >>>>>>>> WSO2, Inc. >>>>>>>> http://wso2.com/ >>>>>>>> lean | enterprise | middleware >>>>>>>> Mobile : +94 716546324 >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Thanks & regards, >>>>>>> Nirmal >>>>>>> >>>>>>> Team Lead - WSO2 Machine Learner >>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>>>> Mobile: +94715779733 >>>>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Supun Sethunga* >>>>>> Software Engineer >>>>>> WSO2, Inc. >>>>>> http://wso2.com/ >>>>>> lean | enterprise | middleware >>>>>> Mobile : +94 716546324 >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> ============================ >>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera >>>>> Site: http://people.apache.org/~hemapani/ >>>>> Photos: http://www.flickr.com/photos/hemapani/ >>>>> Phone: 0772360902 >>>>> >>>> >>>> >>>> >>>> -- >>>> ============================ >>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera >>>> Site: http://people.apache.org/~hemapani/ >>>> Photos: http://www.flickr.com/photos/hemapani/ >>>> Phone: 0772360902 >>>> >>> >>> >>> >>> -- >>> Pruthuvi Maheshakya Wijewardena >>> Software Engineer >>> WSO2 : http://wso2.com/ >>> Email: [email protected] >>> Mobile: +94711228855 >>> >>> >>> >> >> >> -- >> >> Thanks & regards, >> Nirmal >> >> Team Lead - WSO2 Machine Learner >> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >> Mobile: +94715779733 >> Blog: http://nirmalfdo.blogspot.com/ >> >> >> >> _______________________________________________ >> Dev mailing list >> [email protected] >> http://wso2.org/cgi-bin/mailman/listinfo/dev >> >> > > > -- > Regards, > > Thushan Ganegedara > School of IT > University of Sydney, Australia > -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
_______________________________________________ Dev mailing list [email protected] http://wso2.org/cgi-bin/mailman/listinfo/dev
