[ 
https://issues.apache.org/jira/browse/MADLIB-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16227685#comment-16227685
 ] 

Rahul Iyer commented on MADLIB-1167:
------------------------------------

For 4) we can report the CI for the mean using the z-score. This should be OK 
for most cases since this function will be run on big data (i.e. large sample). 
Hence the sample distribution should not be an issue. 

The min-max values work on all numeric types and also on text types ('varchar', 
'bpchar', 'text'). For the text types it returns the min and max length of the 
text values. 

> Summary - add more statistics
> -----------------------------
>
>                 Key: MADLIB-1167
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1167
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Descriptive Statistics
>            Reporter: Frank McQuillan
>             Fix For: v2.0
>
>
> In the summary function
> http://madlib.apache.org/docs/latest/group__grp__summary.html
> add additional statistics:
> 1) % positive values
> 2) % negative values
> 3) % zero values
> 4) confidence intervals (95% ?) on mean
> * does this make sense, since need to assume a distribution for the data 
> which we probably cannot infer?
> Also please check why min and max are being reported for non-numeric cols.  
> Is this a bug?
> {code}
> madlib=# SELECT * FROM houses_summary where target_column='zipcode';
> -[ RECORD 1 ]--------+----------------
> group_by             | 
> group_by_value       | 
> target_column        | zipcode
> column_number        | 8
> data_type            | text
> row_count            | 15
> distinct_values      | 2
> missing_values       | 0
> blank_values         | 0
> fraction_missing     | 0
> fraction_blank       | 0
> mean                 | 
> variance             | 
> min                  | 6
> max                  | 6
> first_quartile       | 
> median               | 
> third_quartile       | 
> most_frequent_values | {94301y,84301x}
> mfv_frequencies      | {10,5}
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to