[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

GitBox Wed, 08 Apr 2020 18:31:26 -0700

maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] 
Document built-in aggregate functions in SQL Reference
URL: https://github.com/apache/spark/pull/28120#discussion_r405906265


 ##########
 File path: docs/sql-ref-functions-builtin-aggregate.md
 ##########
 @@ -19,4 +19,657 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Spark SQL provides build-in aggregate functions defined in the dataset API and 
SQL interface. Aggregate functions
+operate on a group of rows and return a single aggregated value.
+
+<table class="table">
+  <thead>
+    <tr><th style="width:25%">Function</th><th>Argument 
Type(s)</th><th>Description</th></tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td><b>{any | some | bool_or}</b>(<i>expression</i>)</td>
+      <td>boolean</td>
+      <td>Returns true if at least one value is true.</td>
+    </tr>
+    <tr>
+      <td><b>approx_count_distinct</b>(<i>expression[, relativeSD]</i>)</td>
+      <td>(long, double)</td>
+      <td>`relativeSD` is the maximum estimation error allowed. Returns the 
estimated cardinality by HyperLogLog++.</td>
+    </tr>   
+    <tr>
+      <td><b>{avg | mean}</b>(<i>expression</i>)</td>
+      <td>short, float, byte, decimal, double, int, long or string</td>
+      <td>Returns the average of values in the input expression.</td> 
+    </tr>
+    <tr>
+      <td><b>{bool_and | every}</b>(<i>expression</i>)</td>
+      <td>boolean</td>
+      <td>Returns true if all values are true.</td>
+    </tr>
+    <tr>
+      <td><b>collect_list</b>(<i>expression</i>)</td>
+      <td>any</td>
+      <td>Collects and returns a list of non-unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.</td>
+    </tr>       
+    <tr>
+      <td><b>collect_set</b>(<i>expression</i>)</td>
+      <td>any</td>
+      <td>Collects and returns a set of unique elements. The function is 
non-deterministic because the order of collected results depends on the order 
of the rows which may be non-deterministic after a shuffle.</td>
+    </tr>
+    <tr>
+      <td><b>corr</b>(<i>expression1, expression2</i>)</td>
+      <td>(double, double)</td>
+      <td>Returns Pearson coefficient of correlation between a set of number 
pairs.</td>
+    </tr>
+    <tr>
+      <td><b>count</b>([<b>DISTINCT</b>] <i>*</i>)</td>
+      <td>none</td>
+      <td>If specified <code>DISTINCT</code>, returns the total number of 
retrieved rows are unique and not null; Otherwise, returns the total number of 
retrieved rows, including rows containing null.</td>
+    </tr>
+    <tr>
+      <td><b>count</b>([<b>DISTINCT</b>] <i>expression1[, 
expression2</i>])</td>
+      <td>(any, any)</td>
+      <td>If specified <code>DISTINCT</code>, returns the number of rows for 
which the supplied expression(s) are unique and not null; Otherwise, returns 
the number of rows for which the supplied expression(s) are all not null.</td>
+    </tr>
+    <tr>
+      <td><b>count_if</b>(<i>predicate</i>)</td>
+      <td>expression that will be used for aggregation calculation</td>
+      <td>Returns the count number from the predicate evaluate to `TRUE` 
values.</td>
+    </tr> 
+    <tr>
+      <td><b>count_min_sketch</b>(<i>expression, eps, confidence, 
seed</i>)</td>
+      <td>(byte, short, int, long, string or binary, double,  double, 
integer)</td>
+      <td>`eps` and `confidence` are the double values between 0.0 and 1.0, 
`seed` is a positive integer. Returns a count-min sketch of a expression with 
the given `esp`, `confidence` and `seed`. The result is an array of bytes, 
which can be deserialized to a `CountMinSketch` before usage. Count-min sketch 
is a probabilistic data structure used for cardinality estimation using 
sub-linear space.</td>
+    </tr>
+    <tr>
+      <td><b>covar_pop</b>(<i>expression1, expression2</i>)</td>
+      <td>(double, double)</td>
+      <td>Returns the population covariance of a set of number pairs.</td>
+    </tr> 
+    <tr>
+      <td><b>covar_samp</b>(<i>expression1, expression2</i>)</td>
+      <td>(double, double)</td>
+      <td>Returns the sample covariance of a set of number pairs.</td>
+    </tr>  
+    <tr>
+      <td><b>{first | first_value}</b>(<i>expression[, isIgnoreNull]</i>)</td>
+      <td>(any, boolean)</td>
+      <td>Returns the first value of expression for a group of rows. If 
<code>isIgnoreNull</code> is true, returns only non-null values, default is 
false. This function is non-deterministic.</td>
+    </tr>      
+    <tr>
+      <td><b>kurtosis</b>(<i>expression</i>)</td>
+      <td>double</td>
+      <td>Returns the kurtosis value calculated from values of a group.</td>
+    </tr>    
+    <tr>
+      <td><b>{last | last_value}</b>(<i>expression[, isIgnoreNull]</i>)</td>
+      <td>(any, boolean)</td>
+      <td>Returns the last value of expression for a group of rows. If 
<code>isIgnoreNull</code> is true, returns only non-null values, default is 
false. This function is non-deterministic.</td>
+    </tr>      
+    <tr>
+      <td><b>max</b>(<i>expression</i>)</td>
+      <td>short, float, byte, decimal, double, int, long, string, date, 
timestamp or arrays of these types</td>
+      <td>Returns the maximum value of the expression.</td>
+    </tr>          
+    <tr>
+      <td><b>max_by</b>(<i>expression1, expression2</i>)</td>
+      <td>short, float, byte, decimal, double, int, long, string, date, 
timestamp or arrays of these types</td>
+      <td>Returns the value of expression1 associated with the maximum value 
of expression2.</td>
+    </tr>   
+    <tr>
+      <td><b>min</b>(<i>expression</i>)</td>
+      <td>short, float, byte, decimal, double, int, long, string, date, 
timestamp or arrays of these types</td>
+      <td>Returns the minimum value of the expression.</td>
+    </tr>          
+    <tr>
+      <td><b>min_by</b>(<i>expression1, expression2</i>)</td>
+      <td>short, float, byte, decimal, double, int, long, string, date, 
timestamp or arrays of these types</td>
+      <td>Returns the value of expression1 associated with the minimum value 
of expression2.</td>
+    </tr>      
+    <tr>
+      <td><b>percentile</b>(<i>expression, percentage [, frequency]</i>)</td>
+      <td>short, float, byte, decimal, double, int, or long, double, int</td>
+      <td>`percentage` is a number between 0 and 1; `frequency` is a positive 
integer. Returns the exact percentile value of numeric expression at the given 
percentage.</td>
+    </tr>         
+    <tr>
+      <td><b>percentile</b>(<i>expression, <b>array</b>(percentage1 [, 
percentage2]...) [, frequency]</i>)</td>
+      <td>short, float, byte, decimal, double, int, or long, double, int</td>
+      <td>Percentage array is an array of number between 0 and 1; `frequency` 
is a positive integer. Returns the exact percentile value array of numeric 
expression at the given percentage(s).</td>
+    </tr>        
+    <tr>
+      <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, 
percentage [, frequency]</i>)</td>
+      <td>short, float, byte, decimal, double, int, or long, double, int</td>
+      <td>`percentage` is a number between 0 and 1; `frequency` is a positive 
integer. Returns the approximate percentile value of numeric expression at the 
given percentage.</td>
+    </tr>    
+   <tr>
+      <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, 
percentage [, frequency]</i>)</td>
+      <td>date or timestamp, double, int</td>
+      <td>`percentage` is a number between 0 and 1; `frequency` is a positive 
integer. Returns the approximate percentile value of numeric expression at the 
given percentage.</td>
+    </tr>                  
+    <tr>
+      <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, 
<b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
+      <td>short, float, byte, decimal, double, int, or long, double, int</td>
+      <td>`percentage` is a number between 0 and 1; `frequency` is a positive 
integer. Returns the approximate percentile value of numeric expression at the 
given percentage.</td>
+    </tr>             
+    <tr>
+      <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, 
<b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
+      <td>date or timestamp, double, int</td>
+      <td>`percentage` is a number between 0 and 1; `frequency` is a positive 
integer. Returns the approximate percentile value of numeric expression at the 
given percentage.</td>
+    </tr>             
+    <tr>
+      <td><b>skewness</b>(<i>expression</i>)</td>
+      <td>double</td>
+      <td>Returns the skewness value calculated from values of a group.</td>
+    </tr>    
+    <tr>
+      <td><b>{stddev_samp | stddev | std}</b>(<i>expression</i>)</td>
+      <td>double</td>
+      <td>Returns the sample standard deviation calculated from values of a 
group.</td>
+    </tr>  
+    <tr>
+      <td><b>stddev_pop</b>(<i>expression</i>)</td>
+      <td>double</td>
+      <td>Returns the population standard deviation calculated from values of 
a group.</td>
+    </tr>
+    <tr>
+      <td><b>sum</b>(<i>expression</i>)</td>
+      <td>short, float, byte, decimal, double, int, or long</td>
+      <td>Returns the sum calculated from values of a group.</td>
+    </tr>       
+    <tr>
+      <td><b>{variance | var_samp}</b>(<i>expression</i>)</td>
+      <td>double</td>
+      <td>Returns the sample variance calculated from values of a group.</td>
+    </tr>    
+    <tr>
+      <td><b>var_pop</b>(<i>expression</i>)</td>
+      <td>double</td>
+      <td>Returns the population variance calculated from values of a 
group.</td>
+    </tr>        
+  </tbody>
+</table>
+
+### Examples
+{% highlight sql %}
+--base table 
+
+SELECT * FROM buildin_agg;
++----+----+----+-----+----+
+|  c1|  c2|  c3|   c4|  c5|
++----+----+----+-----+----+
+|   2|   3|agg4| true|true|
+|   1|   2|agg3|false|true|
+|   1|   1|agg1|false|true|
+|   4|   3|agg6|false|true|
+|   3|   3|agg5| true|true|
+|   1|   2|agg2|false|true|
+|   5|null|agg8|false|true|
+|null|   4|agg7|false|true|
++----+----+----+-----+----+
+
+-- any, some and bool_or
+
+SELECT ANY(c4) FROM buildin_agg;
++-------+
+|any(c4)|
++-------+
+|   true|
++-------+
+
+SELECT SOME(c4) FROM buildin_agg;
++-------+
+|any(c4)|
++-------+
+|   true|
++-------+
+
+SELECT BOOL_OR(c5) FROM buildin_agg;
++-----------+
+|bool_or(c5)|
++-----------+
+|       true|
++-----------+
+
+-- approx_count_distinct
+
+SELECT APPROX_COUNT_DISTINCT(c1) FROM buildin_agg;
++-------------------------+
+|approx_count_distinct(c1)|
++-------------------------+
+|                        5|
++-------------------------+
+
+SELECT APPROX_COUNT_DISTINCT(c1,0.39) FROM buildin_agg;
++-------------------------+
+|approx_count_distinct(c1)|
++-------------------------+
+|                        6|
++-------------------------+
+
+-- avg and mean
+
+SELECT AVG(c1) FROM buildin_agg;
++------------------+
+|           avg(c1)|
++------------------+
+|2.4285714285714284|
++------------------+
+
+SELECT MEAN(c1) FROM buildin_agg;
++------------------+
+|          mean(c1)|
++------------------+
+|2.4285714285714284|
++------------------+
+
+-- bool_and and every
+ 
+SELECT BOOL_AND(c4) FROM buildin_agg;
++------------+
+|bool_and(c4)|
++------------+
+|       false|
++------------+
+
+SELECT EVERY(c5) FROM buildin_agg;
++------------+
+|bool_and(c5)|
++------------+
+|        true|
++------------+
+
+--collect_list
+
+SELECT COLLECT_LIST(c2) FROM buildin_agg;
++---------------------+
+|collect_list(c2)     |
++---------------------+
+|[3, 2, 1, 3, 3, 2, 4]|
++---------------------+
+
+SELECT COLLECT_LIST(c4) FROM buildin_agg;
++------------------------------------------------------+
+|collect_list(c4)                                      |
 
 Review comment:
   Could you make the output right-aligned along with the others?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] maropu commented on a change in pull request #28120: [SPARK-31349][SQL][DOCS] Document built-in aggregate functions in SQL Reference

Reply via email to