[ https://issues.apache.org/jira/browse/MADLIB-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297240#comment-16297240 ]
Frank McQuillan edited comment on MADLIB-1167 at 12/19/17 6:57 PM: ------------------------------------------------------------------- (1) row count Yes, I agree it is confusing and that we should change the name of the 1st one (composite return type) as you suggest for {code} `SELECT * FROM madlib.summary(valid_inputs)` {code} I would suggest that we change: {code} row_count INTEGER. The number of rows in the output table. {code} to: {code} num_col_summarized INTEGER. The number of columns from the source table that have been summarized. {code} Note that this will affect the user documentation and examples so they will need to be updated. (2) CI format For CI, float8 seems to be used elsewhere in the summary() function so that is fine to use. And I think regular {a,b} is fine . Indicating inclusive/exclusive bounds is generally not important when we are talking about a float that does not have fixed min/max. was (Author: fmcquillan): (1) row count Yes, I agree it is confusing and that we should change the name of the 1st one (composite return type) as you suggest for `SELECT * FROM madlib.summary(valid_inputs)` I would suggest that we change: {code} row_count INTEGER. The number of rows in the output table. {code} to: {code} num_col_summarized INTEGER. The number of columns from the source table that have been summarized. {code} Note that this will affect the user documentation and examples so they will need to be updated. (2) CI format For CI, float8 seems to be used elsewhere in the summary() function so that is fine to use. And I think regular {a,b} is fine . Indicating inclusive/exclusive bounds is generally not important when we are talking about a float that does not have fixed min/max. > Summary - add more statistics > ----------------------------- > > Key: MADLIB-1167 > URL: https://issues.apache.org/jira/browse/MADLIB-1167 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Descriptive Statistics > Reporter: Frank McQuillan > Assignee: Jingyi Mei > Fix For: v1.14 > > > In the summary function > http://madlib.apache.org/docs/latest/group__grp__summary.html > add additional statistics: > 1) % positive values > 2) % negative values > 3) % zero values > 4) confidence intervals (95% ?) on mean > * does this make sense, since need to assume a distribution for the data > which we probably cannot infer? > 5) Also please check why min and max are being reported for non-numeric cols. > Is this a bug? > {code} > madlib=# SELECT * FROM houses_summary where target_column='zipcode'; > -[ RECORD 1 ]--------+---------------- > group_by | > group_by_value | > target_column | zipcode > column_number | 8 > data_type | text > row_count | 15 > distinct_values | 2 > missing_values | 0 > blank_values | 0 > fraction_missing | 0 > fraction_blank | 0 > mean | > variance | > min | 6 > max | 6 > first_quartile | > median | > third_quartile | > most_frequent_values | {94301y,84301x} > mfv_frequencies | {10,5} > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)