[ https://issues.apache.org/jira/browse/MADLIB-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297035#comment-16297035 ]
Jingyi Mei edited comment on MADLIB-1167 at 12/19/17 4:23 PM: -------------------------------------------------------------- For the format of confidence interval: There are postgres built-in range types we can use (https://www.postgresql.org/docs/9.5/static/rangetypes.html). Do we have any requirement for digits precision? `numrange` is a built-in type we can use which we can decide the digits precision, otherwise, if we want to use float8(which is more consistent with other columns in output table), we have to define our own db type as `floatrange`. Also, for a range type, we can choose (), [], (], [) to show the inclusive and exclusive bounds. For CI, does '(a,b)' make more sense than '[a,b]'? Another way to implement CI is to just use a two-element array and we can get something like {a,b}, but I think this is not the best option. was (Author: jingyimei): For the format of confidence interval: There are postgres built-in range types we can use (https://www.postgresql.org/docs/9.5/static/rangetypes.html). Do we have any requirement for digits precision? `numrange` is a built-in type we can use which we can decide the digits precision, otherwise, if we want to use float8(this is more consistent with other columns in output table), we have to define our own db type as `floatrange`. Also, for a range type, we can choose (), [], (], [) to show the inclusive and exclusive bounds. For CI, does '(a,b)' make more sense than '[a,b]'? Another way to implement CI is to just use a two-element array and we can get something like {a,b}, but I think this is not the best option. > Summary - add more statistics > ----------------------------- > > Key: MADLIB-1167 > URL: https://issues.apache.org/jira/browse/MADLIB-1167 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Descriptive Statistics > Reporter: Frank McQuillan > Assignee: Jingyi Mei > Fix For: v1.14 > > > In the summary function > http://madlib.apache.org/docs/latest/group__grp__summary.html > add additional statistics: > 1) % positive values > 2) % negative values > 3) % zero values > 4) confidence intervals (95% ?) on mean > * does this make sense, since need to assume a distribution for the data > which we probably cannot infer? > 5) Also please check why min and max are being reported for non-numeric cols. > Is this a bug? > {code} > madlib=# SELECT * FROM houses_summary where target_column='zipcode'; > -[ RECORD 1 ]--------+---------------- > group_by | > group_by_value | > target_column | zipcode > column_number | 8 > data_type | text > row_count | 15 > distinct_values | 2 > missing_values | 0 > blank_values | 0 > fraction_missing | 0 > fraction_blank | 0 > mean | > variance | > min | 6 > max | 6 > first_quartile | > median | > third_quartile | > most_frequent_values | {94301y,84301x} > mfv_frequencies | {10,5} > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)