Re: [HACKERS] Built-in binning functions

Petr Jelinek Mon, 25 Aug 2014 07:17:00 -0700

Hi,

I finally had some time to get back to this.

I attached version3 of the patch which "fixes" Tom's complaint aboutint8 version by removing the int8 version as it does not seem necessary(the float8 can handle integers just fine).

This patch now basically has just one optimized function for float8 andone for numeric datatypes, just like width_bucket.

On 08/07/14 02:14, Tom Lane wrote:
Also, I'm not convinced by this business of throwing an error for a
NULL array element.  Per spec, null arguments to width_bucket()
produce a null result, not an error --- shouldn't this flavor act
similarly?  In any case I think the test needs to use
array_contains_nulls() not just ARR_HASNULL.

I really think there should be difference between NULL array and NULLinside array and that NULL inside the array is wrong input. I changedthe check to array_contains_nulls() though.

On 08/07/14 02:14, Tom Lane wrote:
Lastly, the spec defines behaviors for width_bucket that allow either
ascending or descending buckets.  We could possibly do something
similar

I am not sure it's worth it here as we require input to be sortedanyway. It might be worthwhile if we decided to do this as an aggregate(since there input would not be required to be presorted) instead offunction but I am not sure if that's desirable either as that wouldlimit usability to only the single main use-case (group by and count()).



On 20/07/14 11:01, Simon Riggs wrote:

On 16 July 2014 20:35, Pavel Stehule <pavel.steh...@gmail.com> wrote:

On 08/07/14 02:14, Tom Lane wrote:


I didn't see any discussion of the naming question in this thread.
I'd like to propose that it should be just "width_bucket()"; we can
easily determine which function is meant, considering that the
SQL-spec variants don't take arrays and don't even have the same
number of actual arguments.


I did mention in submission that the names are up for discussion, I am all
for naming it just width_bucket.


I had this idea too - but I am not sure if it is good idea. A distance
between ANSI SQL with_bucket and our enhancing is larger than in our
implementation of "median" for example.

I can live with both names, but current name I prefer.



So I suggest that we use the more generic function name bin(), with a
second choice of discretize()  (though that seems annoyingly easy to
spell incorrectly)

I really don't think bin() is good choice here, bucket has same meaningin this context and bin is often used as shorthand for binary whichwould be confusing here.

Anyway I currently left the name as it is, I would not be against usingwidth_bucket() or even just bucket(), not sure about the discretize()option.



--
 Petr Jelinek                  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index c715ca2..e28ff68 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -920,6 +920,31 @@
        <entry><literal>width_bucket(5.35, 0.024, 10.06, 5)</literal></entry>
        <entry><literal>3</literal></entry>
       </row>
+
+      <row>
+       <entry>
+        <indexterm>
+         <primary>varwidth_bucket</primary>
+        </indexterm>
+        <literal><function>varwidth_bucket(<parameter>op</parameter> <type>numeric</type>, <parameter>thresholds</parameter> <type>numerc[]</type>)</function></literal></entry>
+       <entry><type>int</type></entry>
+       <entry>return the bucket to which <parameter>operand</> would
+       be assigned based on right-bound bucket <parameter>thresholds</>,
+       the <parameter>thresholds</> must be ordered (smallest first)</entry>
+       <entry><literal>varwidth_bucket(5.35::numeric, ARRAY[1, 3, 4, 6]::numeric[])</literal></entry>
+       <entry><literal>3</literal></entry>
+      </row>
+
+      <row>
+       <entry><literal><function>varwidth_bucket(<parameter>op</parameter> <type>dp</type>, <parameter>thresholds</parameter> <type>dp[]</type>)</function></literal></entry>
+       <entry><type>int</type></entry>
+       <entry>return the bucket to which <parameter>operand</> would
+       be assigned based on right-bound bucket <parameter>thresholds</>,
+       the <parameter>thresholds</> must be ordered (smallest first)</entry>
+       <entry><literal>varwidth_bucket(5.35::float8, ARRAY[1, 3, 4, 6]::float8[])</literal></entry>
+       <entry><literal>3</literal></entry>
+      </row>
+
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index 41b3eaa..fcbb623 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -2800,6 +2800,61 @@ width_bucket_float8(PG_FUNCTION_ARGS)
 	PG_RETURN_INT32(result);
 }
 
+/*
+ * Implements the float8 version of the varwidth_bucket() function.
+ * See also varwidth_bucket_general() and varwidth_bucket_int8().
+ *
+ * 'thresholds' is an array (must be sorted from smallest to biggest value)
+ * containing right-bounds for each "bucket", varwidth_bucket() returns
+ * integer indicating the bucket number that 'operand' belongs to. An operand
+ * greater than the upper bound is assigned to an additional bucket.
+ */
+Datum
+varwidth_bucket_float8(PG_FUNCTION_ARGS)
+{
+	float8		operand = PG_GETARG_FLOAT8(0);
+	ArrayType  *thresholds_in = PG_GETARG_ARRAYTYPE_P(1);
+	float8	   *thresholds;
+	int32		left;
+	int32		mid;
+	int32		right;
+
+	/* Verify input */
+	if (ARR_NDIM(thresholds_in) != 1)
+		ereport(ERROR,
+			(errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR),
+			 errmsg("thresholds must be one dimensional array ")));
+
+	if (array_contains_nulls(thresholds_in))
+		ereport(ERROR,
+			(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+			 errmsg("thresholds array must not contain NULLs")));
+
+	/* Make sure val and thresholds use same types so we can be sure they can be compared. */
+	if (ARR_ELEMTYPE(thresholds_in) != FLOAT8OID)
+		ereport(ERROR,
+		(errcode(ERRCODE_ARRAY_ELEMENT_ERROR),
+		 errmsg("thresholds array must be type float8[]")));
+
+	thresholds = (float8 *) ARR_DATA_PTR(thresholds_in);
+
+	/* Find the bucket */
+	left = 0;
+	right = ArrayGetNItems(ARR_NDIM(thresholds_in), ARR_DIMS(thresholds_in));
+	while (left < right) {
+		mid = (left + right) / 2;
+		if (operand < thresholds[mid])
+			right = mid;
+		else
+			left = mid + 1;
+	}
+
+	/* Avoid leaking memory when handed toasted input. */
+	PG_FREE_IF_COPY(thresholds_in, 1);
+
+	PG_RETURN_INT32(left);
+}
+
 /* ========== PRIVATE ROUTINES ========== */
 
 #ifndef HAVE_CBRT
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 19d0bdc..71b3363 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -35,6 +35,7 @@
 #include "utils/builtins.h"
 #include "utils/int8.h"
 #include "utils/numeric.h"
+#include "utils/typcache.h"
 
 /* ----------
  * Uncomment the following to enable compilation of dump_numeric()
@@ -1344,6 +1345,97 @@ compute_bucket(Numeric operand, Numeric bound1, Numeric bound2,
 	free_var(&operand_var);
 }
 
+
+/*
+ * Implements the numeric version of the varwidth_bucket() function.
+ * See also varwidth_bucket_float8().
+ */
+Datum
+varwidth_bucket_numeric(PG_FUNCTION_ARGS)
+{
+	Numeric		operand = PG_GETARG_NUMERIC(0);
+	ArrayType  *thresholds_in = PG_GETARG_ARRAYTYPE_P(1);
+	char	   *thresholds;
+	Oid			collation = PG_GET_COLLATION();
+	TypeCacheEntry *typentry;
+	int32		left;
+	int32		mid;
+	int32		right;
+	int			typlen;
+	bool		typbyval;
+	char		typalign;
+
+	/* Verify input */
+	if (ARR_NDIM(thresholds_in) != 1)
+		ereport(ERROR,
+			(errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR),
+			 errmsg("thresholds must be one dimensional array ")));
+
+	if (array_contains_nulls(thresholds_in))
+		ereport(ERROR,
+			(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+			 errmsg("thresholds array must not contain NULLs")));
+
+	/* Make sure val and thresholds use same types so we can be sure they can be compared. */
+	if (ARR_ELEMTYPE(thresholds_in) != NUMERICOID)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATATYPE_MISMATCH),
+				 errmsg("thresholds array must be of type numeric[]")));
+
+	/* Fetch information about the input type */
+	typentry = (TypeCacheEntry *) fcinfo->flinfo->fn_extra;
+	if (typentry == NULL ||
+		typentry->type_id != NUMERICOID)
+	{
+		typentry = lookup_type_cache(NUMERICOID,
+									 TYPECACHE_TUPDESC);
+		fcinfo->flinfo->fn_extra = (void *) typentry;
+	}
+	typlen = typentry->typlen;
+	typbyval = typentry->typbyval;
+	typalign = typentry->typalign;
+
+	/* Find the bucket */
+	left = 0;
+	right = ArrayGetNItems(ARR_NDIM(thresholds_in), ARR_DIMS(thresholds_in));
+	thresholds = ARR_DATA_PTR(thresholds_in);
+
+	while (left < right) {
+		int		i;
+		Numeric		val;
+		char *ptr = thresholds;
+
+		mid = (left + right) / 2;
+		for (i = left; i < mid; i++)
+		{
+			ptr = att_addlength_pointer(ptr, typlen, ptr);
+			ptr = (char *) att_align_nominal(ptr, typalign);
+		}
+		val = DatumGetNumeric(fetch_att(ptr, typbyval, typlen));
+
+		if (cmp_numerics(operand, val) < 0)
+		{
+			right = mid;
+		}
+		else
+		{
+			/*
+			 * Move the thresholds pointer so we don't have to seek from
+			 * the beginning of array again.
+			 * */
+			thresholds = att_addlength_pointer(ptr, typlen, ptr);
+			thresholds = (char *) att_align_nominal(thresholds, typalign);
+			left = mid + 1;
+		}
+	}
+
+	/* Avoid leaking memory when handed toasted input. */
+	PG_FREE_IF_COPY(thresholds_in, 1);
+
+	PG_RETURN_INT32(left);
+}
+
+
 /* ----------------------------------------------------------------------
  *
  * Comparison functions
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index a84595e..2a4d993 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -515,6 +515,8 @@ DATA(insert OID = 309 (  float84gt		   PGNSP PGUID 12 1 0 0 0 f f f t t f i 2 0
 DATA(insert OID = 310 (  float84ge		   PGNSP PGUID 12 1 0 0 0 f f f t t f i 2 0 16 "701 700" _null_ _null_ _null_ _null_ float84ge _null_ _null_ _null_ ));
 DATA(insert OID = 320 ( width_bucket	   PGNSP PGUID 12 1 0 0 0 f f f f t f i 4 0 23 "701 701 701 23" _null_ _null_ _null_ _null_ width_bucket_float8 _null_ _null_ _null_ ));
 DESCR("bucket number of operand in equidepth histogram");
+DATA(insert OID = 3256 ( varwidth_bucket   PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 23 "701 1022" _null_ _null_ _null_ _null_ varwidth_bucket_float8 _null_ _null_ _null_ ));
+DESCR("bucket number of operand in the set of right-bound buckets");
 
 DATA(insert OID = 311 (  float8			   PGNSP PGUID 12 1 0 0 0 f f f f t f i 1 0 701 "700" _null_ _null_ _null_ _null_	ftod _null_ _null_ _null_ ));
 DESCR("convert float4 to float8");
@@ -2298,6 +2300,8 @@ DATA(insert OID = 1980 ( numeric_div_trunc		PGNSP PGUID 12 1 0 0 0 f f f f t f i
 DESCR("trunc(x/y)");
 DATA(insert OID = 2170 ( width_bucket			PGNSP PGUID 12 1 0 0 0 f f f f t f i 4 0 23 "1700 1700 1700 23" _null_ _null_ _null_ _null_ width_bucket_numeric _null_ _null_ _null_ ));
 DESCR("bucket number of operand in equidepth histogram");
+DATA(insert OID = 3257 ( varwidth_bucket		PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 23 "1700 1231" _null_ _null_ _null_ _null_ varwidth_bucket_numeric _null_ _null_ _null_ ));
+DESCR("bucket number of operand in the set of right-bound buckets");
 
 DATA(insert OID = 1747 ( time_pl_interval		PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 1083 "1083 1186" _null_ _null_ _null_ _null_	time_pl_interval _null_ _null_ _null_ ));
 DATA(insert OID = 1748 ( time_mi_interval		PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 1083 "1083 1186" _null_ _null_ _null_ _null_	time_mi_interval _null_ _null_ _null_ ));
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index b0a4748..2283baf 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -452,6 +452,7 @@ extern Datum float84le(PG_FUNCTION_ARGS);
 extern Datum float84gt(PG_FUNCTION_ARGS);
 extern Datum float84ge(PG_FUNCTION_ARGS);
 extern Datum width_bucket_float8(PG_FUNCTION_ARGS);
+extern Datum varwidth_bucket_float8(PG_FUNCTION_ARGS);
 
 /* dbsize.c */
 extern Datum pg_tablespace_size_oid(PG_FUNCTION_ARGS);
@@ -1036,6 +1037,7 @@ extern Datum int4_avg_accum_inv(PG_FUNCTION_ARGS);
 extern Datum int8_avg(PG_FUNCTION_ARGS);
 extern Datum int2int4_sum(PG_FUNCTION_ARGS);
 extern Datum width_bucket_numeric(PG_FUNCTION_ARGS);
+extern Datum varwidth_bucket_numeric(PG_FUNCTION_ARGS);
 extern Datum hash_numeric(PG_FUNCTION_ARGS);
 
 /* ri_triggers.c */
diff --git a/src/test/regress/expected/arrays.out b/src/test/regress/expected/arrays.out
index 4286691..391abf1 100644
--- a/src/test/regress/expected/arrays.out
+++ b/src/test/regress/expected/arrays.out
@@ -1706,3 +1706,51 @@ select length(md5((f1[1]).c2)) from dest;
 
 drop table dest;
 drop type textandtext;
+-- Testing for varwidth_bucket(). For convenience, we test both the
+-- numeric and float8.
+-- errors
+SELECT varwidth_bucket(5.0::numeric, ARRAY[3, 4, NULL]);
+ERROR:  thresholds array must not contain NULLs
+SELECT varwidth_bucket(5.0::numeric, ARRAY[ARRAY[1, 2], ARRAY[3, 4]]);
+ERROR:  thresholds must be one dimensional array 
+SELECT varwidth_bucket(5.0::float8, ARRAY[3, 4, NULL]);
+ERROR:  thresholds array must not contain NULLs
+SELECT varwidth_bucket(5.0::float8, ARRAY[ARRAY[1, 2], ARRAY[3, 4]]);
+ERROR:  thresholds must be one dimensional array 
+-- normal operation
+CREATE TABLE varwidth_bucket_test (operand_num numeric, operand_f8 float8);
+COPY varwidth_bucket_test (operand_num) FROM stdin;
+UPDATE varwidth_bucket_test SET operand_f8 = operand_num::float8;
+SELECT
+    operand_num,
+    varwidth_bucket(operand_num, ARRAY[1, 3, 5, 10]) AS wb_1,
+    varwidth_bucket(operand_f8, ARRAY[1, 3, 5, 10]) AS wb_1f,
+    varwidth_bucket(operand_num, ARRAY[0, 5.5, 9.99]) AS wb_2,
+    varwidth_bucket(operand_f8, ARRAY[0, 5.5, 9.99]) AS wb_2f,
+    varwidth_bucket(operand_num, ARRAY[-6, -5, 2]) AS wb_3,
+    varwidth_bucket(operand_f8, ARRAY[-6, -5, 2]) AS wb_3f
+    FROM varwidth_bucket_test;
+   operand_num    | wb_1 | wb_1f | wb_2 | wb_2f | wb_3 | wb_3f 
+------------------+------+-------+------+-------+------+-------
+             -5.2 |    0 |     0 |    0 |     0 |    1 |     1
+    -0.0000000001 |    0 |     0 |    0 |     0 |    2 |     2
+   0.000000000001 |    0 |     0 |    1 |     1 |    2 |     2
+                1 |    1 |     1 |    1 |     1 |    2 |     2
+ 1.99999999999999 |    1 |     1 |    1 |     1 |    2 |     2
+                2 |    1 |     1 |    1 |     1 |    3 |     3
+ 2.00000000000001 |    1 |     1 |    1 |     1 |    3 |     3
+                3 |    2 |     2 |    1 |     1 |    3 |     3
+                4 |    2 |     2 |    1 |     1 |    3 |     3
+              4.5 |    2 |     2 |    1 |     1 |    3 |     3
+                5 |    3 |     3 |    1 |     1 |    3 |     3
+              5.5 |    3 |     3 |    2 |     2 |    3 |     3
+                6 |    3 |     3 |    2 |     2 |    3 |     3
+                7 |    3 |     3 |    2 |     2 |    3 |     3
+                8 |    3 |     3 |    2 |     2 |    3 |     3
+                9 |    3 |     3 |    2 |     2 |    3 |     3
+ 9.99999999999999 |    3 |     3 |    3 |     3 |    3 |     3
+               10 |    4 |     4 |    3 |     3 |    3 |     3
+ 10.0000000000001 |    4 |     4 |    3 |     3 |    3 |     3
+(19 rows)
+
+DROP TABLE varwidth_bucket_test;
diff --git a/src/test/regress/sql/arrays.sql b/src/test/regress/sql/arrays.sql
index d9f7cbf..d6221c6 100644
--- a/src/test/regress/sql/arrays.sql
+++ b/src/test/regress/sql/arrays.sql
@@ -476,3 +476,51 @@ drop table src;
 select length(md5((f1[1]).c2)) from dest;
 drop table dest;
 drop type textandtext;
+
+-- Testing for varwidth_bucket(). For convenience, we test both the
+-- numeric and float8.
+
+-- errors
+SELECT varwidth_bucket(5.0::numeric, ARRAY[3, 4, NULL]);
+SELECT varwidth_bucket(5.0::numeric, ARRAY[ARRAY[1, 2], ARRAY[3, 4]]);
+SELECT varwidth_bucket(5.0::float8, ARRAY[3, 4, NULL]);
+SELECT varwidth_bucket(5.0::float8, ARRAY[ARRAY[1, 2], ARRAY[3, 4]]);
+
+-- normal operation
+CREATE TABLE varwidth_bucket_test (operand_num numeric, operand_f8 float8);
+
+COPY varwidth_bucket_test (operand_num) FROM stdin;
+-5.2
+-0.0000000001
+0.000000000001
+1
+1.99999999999999
+2
+2.00000000000001
+3
+4
+4.5
+5
+5.5
+6
+7
+8
+9
+9.99999999999999
+10
+10.0000000000001
+\.
+
+UPDATE varwidth_bucket_test SET operand_f8 = operand_num::float8;
+
+SELECT
+    operand_num,
+    varwidth_bucket(operand_num, ARRAY[1, 3, 5, 10]) AS wb_1,
+    varwidth_bucket(operand_f8, ARRAY[1, 3, 5, 10]) AS wb_1f,
+    varwidth_bucket(operand_num, ARRAY[0, 5.5, 9.99]) AS wb_2,
+    varwidth_bucket(operand_f8, ARRAY[0, 5.5, 9.99]) AS wb_2f,
+    varwidth_bucket(operand_num, ARRAY[-6, -5, 2]) AS wb_3,
+    varwidth_bucket(operand_f8, ARRAY[-6, -5, 2]) AS wb_3f
+    FROM varwidth_bucket_test;
+
+DROP TABLE varwidth_bucket_test;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Built-in binning functions

Reply via email to