[ 
https://issues.apache.org/jira/browse/MADLIB-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297035#comment-16297035
 ] 

Jingyi Mei edited comment on MADLIB-1167 at 12/19/17 4:23 PM:
--------------------------------------------------------------

For the format of confidence interval:
There are postgres built-in range types we can use 
(https://www.postgresql.org/docs/9.5/static/rangetypes.html). Do we have any 
requirement for digits precision? `numrange` is a built-in type we can use 
which we can decide the digits precision, otherwise, if we want to use 
float8(which is more consistent with other columns in output table), we have to 
define our own db type as `floatrange`. 

Also, for a range type, we can choose (), [], (], [) to show the inclusive and 
exclusive bounds. For CI, does '(a,b)' make more sense than '[a,b]'?

Another way to implement CI is to just use a two-element array and we can get 
something like {a,b}, but I think this is not the best option.


was (Author: jingyimei):
For the format of confidence interval:
There are postgres built-in range types we can use 
(https://www.postgresql.org/docs/9.5/static/rangetypes.html). Do we have any 
requirement for digits precision? `numrange` is a built-in type we can use 
which we can decide the digits precision, otherwise, if we want to use 
float8(this is more consistent with other columns in output table), we have to 
define our own db type as `floatrange`. 

Also, for a range type, we can choose (), [], (], [) to show the inclusive and 
exclusive bounds. For CI, does '(a,b)' make more sense than '[a,b]'?

Another way to implement CI is to just use a two-element array and we can get 
something like {a,b}, but I think this is not the best option.

> Summary - add more statistics
> -----------------------------
>
>                 Key: MADLIB-1167
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1167
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Descriptive Statistics
>            Reporter: Frank McQuillan
>            Assignee: Jingyi Mei
>             Fix For: v1.14
>
>
> In the summary function
> http://madlib.apache.org/docs/latest/group__grp__summary.html
> add additional statistics:
> 1) % positive values
> 2) % negative values
> 3) % zero values
> 4) confidence intervals (95% ?) on mean
> * does this make sense, since need to assume a distribution for the data 
> which we probably cannot infer?
> 5) Also please check why min and max are being reported for non-numeric cols. 
>  Is this a bug?
> {code}
> madlib=# SELECT * FROM houses_summary where target_column='zipcode';
> -[ RECORD 1 ]--------+----------------
> group_by             | 
> group_by_value       | 
> target_column        | zipcode
> column_number        | 8
> data_type            | text
> row_count            | 15
> distinct_values      | 2
> missing_values       | 0
> blank_values         | 0
> fraction_missing     | 0
> fraction_blank       | 0
> mean                 | 
> variance             | 
> min                  | 6
> max                  | 6
> first_quartile       | 
> median               | 
> third_quartile       | 
> most_frequent_values | {94301y,84301x}
> mfv_frequencies      | {10,5}
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to