Can we use a combination of both? On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga <sup...@wso2.com> wrote:
> When a dataset is large, in general its said to be approximates to a > Normal Distribution. :) True it Hypothetical, but the point they make is, > when the datasets are large, then properties of a distribution like > skewness, variance and etc. become closer to the properties Normal > Distribution in most cases.. > > On Thu, Aug 13, 2015 at 11:07 AM, Nirmal Fernando <nir...@wso2.com> wrote: > >> Hi Supun, >> >> Thanks for the reply. >> >> On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga <sup...@wso2.com> wrote: >> >>> Hi Nirmal, >>> >>> IMO don't think we would be able to use skewness in this case. Skewness >>> says how symmetric the distribution is. For example, if we consider a >>> numerical/continuous feature (not categorical) which is Normally >>> Distributed, then the skewness would be 0. Also for a categorical (encoded) >>> feature having a systematic distribution, then again the skewness would be >>> 0. >>> >> >> What's the probability of you see a normal distribution of a real >> dataset? IMO it's very less and also since what we're doing here is a >> suggestion, do you see it as an issue? >> >> >>> >>> We did have this concern at the beginning as well, regarding how we >>> could determine whether a feature is categorical or Continuous. Usually >>> this is strictly dependent on the domain of the dataset (i.e. user have to >>> decide this with the knowledge about the data). That was the idea behind >>> letting user change the data type.. But since we needed a default option, >>> we had to go for the threshold thing, which was the olny option we could >>> come-up with. I did a bit of research on this too, but only to find no >>> other solution :( >>> >>> Thanks, >>> Supun >>> >>> On Thu, Aug 13, 2015 at 1:49 AM, Nirmal Fernando <nir...@wso2.com> >>> wrote: >>> >>>> Hi All, >>>> >>>> We have a feature in ML where we suggest a given data column of a >>>> dataset is categorical or numerical. Currently, how we determine this is by >>>> using a threshold value (The maximum number of categories that can >>>> have in a non-string categorical feature. If exceeds, the feature will >>>> be treated as a numerical feature.). But this is not a successful >>>> measurement for most of the datasets. >>>> >>>> Can we use 'skewness' of a distribution as a measurement to determine >>>> this? Can we say, a column is numerical, if the modulus of the skewness of >>>> the distribution is less than a certain threshold (say 0.01) ? >>>> >>>> *References*: >>>> >>>> http://www.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%206%20Distributional%20Tests.htm >>>> >>>> -- >>>> >>>> Thanks & regards, >>>> Nirmal >>>> >>>> Team Lead - WSO2 Machine Learner >>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>> Mobile: +94715779733 >>>> Blog: http://nirmalfdo.blogspot.com/ >>>> >>>> >>>> >>> >>> >>> -- >>> *Supun Sethunga* >>> Software Engineer >>> WSO2, Inc. >>> http://wso2.com/ >>> lean | enterprise | middleware >>> Mobile : +94 716546324 >>> >> >> >> >> -- >> >> Thanks & regards, >> Nirmal >> >> Team Lead - WSO2 Machine Learner >> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >> Mobile: +94715779733 >> Blog: http://nirmalfdo.blogspot.com/ >> >> >> > > > -- > *Supun Sethunga* > Software Engineer > WSO2, Inc. > http://wso2.com/ > lean | enterprise | middleware > Mobile : +94 716546324 > -- ============================ Blog: http://srinathsview.blogspot.com twitter:@srinath_perera Site: http://people.apache.org/~hemapani/ Photos: http://www.flickr.com/photos/hemapani/ Phone: 0772360902
_______________________________________________ Dev mailing list Dev@wso2.org http://wso2.org/cgi-bin/mailman/listinfo/dev