Moreover, I think a hybrid approach as follows might work well. 1. Select a sample
2. Filter columns by the data type and find potential categorical variables (integer / string) 3. Filter further by checking if same values are repeated multiple times in the dataset. On Fri, Aug 14, 2015 at 2:53 PM, Nirmal Fernando <nir...@wso2.com> wrote: > Thanks for all the input. > > So let me summarise; > > *the problem* > > * We need to determine whether a feature is a categorical one or not, to > draw certain graphs to explore a dataset, before a user starts to build > analyses (before user input). > * We can't get a 100% accuracy, hence it is of course a suggestion that we > do. > * Question is, what would be the most accurate method. > > *solutions* > > 1. Categorical threshold: if # of distinct values are less than X, it is a > categorical feature. > 2. Make all features with only integers (no decimals) categorical. > 3. Skewness: if skewness of a distribution of a feature is less than X, it > is a categorical feature. > 4. Gaps between consecutive distinct values > 5. Combined solution > > On Fri, Aug 14, 2015 at 9:33 AM, Maheshakya Wijewardena < > mahesha...@wso2.com> wrote: > >> Another approach to distinguish between categorical and numerical >> features can be elaborated as follows: >> >> First, we take out the unique values from the column and sort them. If >> it's a categorical feature, then the gaps between the elements of this >> sorted list should be equal. In a numerical feature, this is extremely >> unlikely to happen. This behavior of valid in most scenarios, but there are >> a few exceptions as well. eg: when a numerical ID is used as the >> categorical label - 19933, 19913, 18832, ... >> >> This is a very simple hack that can be easily implemented, but not a >> standard technique. >> >> WDYT? >> >> On Fri, Aug 14, 2015 at 8:55 AM, Srinath Perera <srin...@wso2.com> wrote: >> >>> I mean current approach and skewness? >>> >>> On Fri, Aug 14, 2015 at 8:54 AM, Srinath Perera <srin...@wso2.com> >>> wrote: >>> >>>> Can we use a combination of both? >>>> >>>> On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga <sup...@wso2.com> >>>> wrote: >>>> >>>>> When a dataset is large, in general its said to be approximates to a >>>>> Normal Distribution. :) True it Hypothetical, but the point they make is, >>>>> when the datasets are large, then properties of a distribution like >>>>> skewness, variance and etc. become closer to the properties Normal >>>>> Distribution in most cases.. >>>>> >>>>> On Thu, Aug 13, 2015 at 11:07 AM, Nirmal Fernando <nir...@wso2.com> >>>>> wrote: >>>>> >>>>>> Hi Supun, >>>>>> >>>>>> Thanks for the reply. >>>>>> >>>>>> On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga <sup...@wso2.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Nirmal, >>>>>>> >>>>>>> IMO don't think we would be able to use skewness in this case. >>>>>>> Skewness says how symmetric the distribution is. For example, if we >>>>>>> consider a numerical/continuous feature (not categorical) which is >>>>>>> Normally >>>>>>> Distributed, then the skewness would be 0. Also for a categorical >>>>>>> (encoded) >>>>>>> feature having a systematic distribution, then again the skewness would >>>>>>> be >>>>>>> 0. >>>>>>> >>>>>> >>>>>> What's the probability of you see a normal distribution of a real >>>>>> dataset? IMO it's very less and also since what we're doing here is a >>>>>> suggestion, do you see it as an issue? >>>>>> >>>>>> >>>>>>> >>>>>>> We did have this concern at the beginning as well, regarding how we >>>>>>> could determine whether a feature is categorical or Continuous. Usually >>>>>>> this is strictly dependent on the domain of the dataset (i.e. user have >>>>>>> to >>>>>>> decide this with the knowledge about the data). That was the idea behind >>>>>>> letting user change the data type.. But since we needed a default >>>>>>> option, >>>>>>> we had to go for the threshold thing, which was the olny option we could >>>>>>> come-up with. I did a bit of research on this too, but only to find no >>>>>>> other solution :( >>>>>>> >>>>>>> Thanks, >>>>>>> Supun >>>>>>> >>>>>>> On Thu, Aug 13, 2015 at 1:49 AM, Nirmal Fernando <nir...@wso2.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> We have a feature in ML where we suggest a given data column of a >>>>>>>> dataset is categorical or numerical. Currently, how we determine this >>>>>>>> is by >>>>>>>> using a threshold value (The maximum number of categories that can >>>>>>>> have in a non-string categorical feature. If exceeds, the feature >>>>>>>> will be treated as a numerical feature.). But this is not a >>>>>>>> successful measurement for most of the datasets. >>>>>>>> >>>>>>>> Can we use 'skewness' of a distribution as a measurement to >>>>>>>> determine this? Can we say, a column is numerical, if the modulus of >>>>>>>> the >>>>>>>> skewness of the distribution is less than a certain threshold (say >>>>>>>> 0.01) ? >>>>>>>> >>>>>>>> *References*: >>>>>>>> >>>>>>>> http://www.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%206%20Distributional%20Tests.htm >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Thanks & regards, >>>>>>>> Nirmal >>>>>>>> >>>>>>>> Team Lead - WSO2 Machine Learner >>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>>>>> Mobile: +94715779733 >>>>>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Supun Sethunga* >>>>>>> Software Engineer >>>>>>> WSO2, Inc. >>>>>>> http://wso2.com/ >>>>>>> lean | enterprise | middleware >>>>>>> Mobile : +94 716546324 >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Thanks & regards, >>>>>> Nirmal >>>>>> >>>>>> Team Lead - WSO2 Machine Learner >>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>>> Mobile: +94715779733 >>>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> *Supun Sethunga* >>>>> Software Engineer >>>>> WSO2, Inc. >>>>> http://wso2.com/ >>>>> lean | enterprise | middleware >>>>> Mobile : +94 716546324 >>>>> >>>> >>>> >>>> >>>> -- >>>> ============================ >>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera >>>> Site: http://people.apache.org/~hemapani/ >>>> Photos: http://www.flickr.com/photos/hemapani/ >>>> Phone: 0772360902 >>>> >>> >>> >>> >>> -- >>> ============================ >>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera >>> Site: http://people.apache.org/~hemapani/ >>> Photos: http://www.flickr.com/photos/hemapani/ >>> Phone: 0772360902 >>> >> >> >> >> -- >> Pruthuvi Maheshakya Wijewardena >> Software Engineer >> WSO2 : http://wso2.com/ >> Email: mahesha...@wso2.com >> Mobile: +94711228855 >> >> >> > > > -- > > Thanks & regards, > Nirmal > > Team Lead - WSO2 Machine Learner > Associate Technical Lead - Data Technologies Team, WSO2 Inc. > Mobile: +94715779733 > Blog: http://nirmalfdo.blogspot.com/ > > > > _______________________________________________ > Dev mailing list > Dev@wso2.org > http://wso2.org/cgi-bin/mailman/listinfo/dev > > -- Regards, Thushan Ganegedara School of IT University of Sydney, Australia
_______________________________________________ Dev mailing list Dev@wso2.org http://wso2.org/cgi-bin/mailman/listinfo/dev