[ https://issues.apache.org/jira/browse/MADLIB-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16227685#comment-16227685 ]
Rahul Iyer commented on MADLIB-1167: ------------------------------------ For 4) we can report the CI for the mean using the z-score. This should be OK for most cases since this function will be run on big data (i.e. large sample). Hence the sample distribution should not be an issue. The min-max values work on all numeric types and also on text types ('varchar', 'bpchar', 'text'). For the text types it returns the min and max length of the text values. > Summary - add more statistics > ----------------------------- > > Key: MADLIB-1167 > URL: https://issues.apache.org/jira/browse/MADLIB-1167 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Descriptive Statistics > Reporter: Frank McQuillan > Fix For: v2.0 > > > In the summary function > http://madlib.apache.org/docs/latest/group__grp__summary.html > add additional statistics: > 1) % positive values > 2) % negative values > 3) % zero values > 4) confidence intervals (95% ?) on mean > * does this make sense, since need to assume a distribution for the data > which we probably cannot infer? > Also please check why min and max are being reported for non-numeric cols. > Is this a bug? > {code} > madlib=# SELECT * FROM houses_summary where target_column='zipcode'; > -[ RECORD 1 ]--------+---------------- > group_by | > group_by_value | > target_column | zipcode > column_number | 8 > data_type | text > row_count | 15 > distinct_values | 2 > missing_values | 0 > blank_values | 0 > fraction_missing | 0 > fraction_blank | 0 > mean | > variance | > min | 6 > max | 6 > first_quartile | > median | > third_quartile | > most_frequent_values | {94301y,84301x} > mfv_frequencies | {10,5} > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)