[ 
https://issues.apache.org/jira/browse/MADLIB-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297240#comment-16297240
 ] 

Frank McQuillan edited comment on MADLIB-1167 at 12/19/17 6:57 PM:
-------------------------------------------------------------------

(1) row count

Yes, I agree it is confusing and that we should change the name of the 1st one 
(composite return type) as you suggest for  
{code}
`SELECT * FROM madlib.summary(valid_inputs)`
{code}

I would suggest that we change: 

{code}
row_count       INTEGER. The number of rows in the output table.
{code}
to:
{code}
num_col_summarized        INTEGER. The number of columns from the source table 
that have been summarized.
{code}
Note that this will affect the user documentation and examples so they will 
need to be updated.


(2) CI format

For CI, float8 seems to be used elsewhere in the summary() function so that is 
fine to use.

And I think regular {a,b} is fine . Indicating inclusive/exclusive bounds is 
generally not important when we are talking about a float that does not have 
fixed min/max.


was (Author: fmcquillan):
(1) row count

Yes, I agree it is confusing and that we should change the name of the 1st one 
(composite return type) as you suggest for  `SELECT * FROM 
madlib.summary(valid_inputs)`

I would suggest that we change: 

{code}
row_count       INTEGER. The number of rows in the output table.
{code}
to:
{code}
num_col_summarized        INTEGER. The number of columns from the source table 
that have been summarized.
{code}
Note that this will affect the user documentation and examples so they will 
need to be updated.


(2) CI format

For CI, float8 seems to be used elsewhere in the summary() function so that is 
fine to use.

And I think regular {a,b} is fine . Indicating inclusive/exclusive bounds is 
generally not important when we are talking about a float that does not have 
fixed min/max.

> Summary - add more statistics
> -----------------------------
>
>                 Key: MADLIB-1167
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1167
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Descriptive Statistics
>            Reporter: Frank McQuillan
>            Assignee: Jingyi Mei
>             Fix For: v1.14
>
>
> In the summary function
> http://madlib.apache.org/docs/latest/group__grp__summary.html
> add additional statistics:
> 1) % positive values
> 2) % negative values
> 3) % zero values
> 4) confidence intervals (95% ?) on mean
> * does this make sense, since need to assume a distribution for the data 
> which we probably cannot infer?
> 5) Also please check why min and max are being reported for non-numeric cols. 
>  Is this a bug?
> {code}
> madlib=# SELECT * FROM houses_summary where target_column='zipcode';
> -[ RECORD 1 ]--------+----------------
> group_by             | 
> group_by_value       | 
> target_column        | zipcode
> column_number        | 8
> data_type            | text
> row_count            | 15
> distinct_values      | 2
> missing_values       | 0
> blank_values         | 0
> fraction_missing     | 0
> fraction_blank       | 0
> mean                 | 
> variance             | 
> min                  | 6
> max                  | 6
> first_quartile       | 
> median               | 
> third_quartile       | 
> most_frequent_values | {94301y,84301x}
> mfv_frequencies      | {10,5}
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to