Re: [HACKERS] custom function for converting human readable sizes to bytes

Pavel Stehule Mon, 25 Jan 2016 23:06:26 -0800

Hi

2016-01-26 6:25 GMT+01:00 Vitaly Burovoy <[email protected]>:


> Hello!
>
>
> Regression tests are present (see p.II).
>
> pg_proc.h has changed, so the CATALOG_VERSION_NO must be also changed.
>

this is not a part of patch - only a commiter knows CATALOG_VERSION_NO


>
> Code comments, error message string, DESCR in pg_proc.h and doc
> changes need proof-reading by a native English speaker, which I am
> not.
>
>
> ===
> The last patch fixes only part of my last notes, so I can only return
> it to the next round with the same notes.
>
> The list of my notes has not been done (all points listed below was
> mentioned in my last review[1]):
>
> 1. The documentation has duplicate parts of size units (in table and
> in the text). I think that one of them should be deleted (from the
> table).
>

fixed


>
> 2. Test has a note about sharing(using) size units via(from)
> memory_unit_conversion_table and a requirement to update the
> documentation when new units are supported, but the note will not be
> seen because there is no tests for that fail for units ("PB", "EB",
> "ZB") which will change in the future by the
> "memory_unit_conversion_table".
>

fixed


> 3. Code design is looked a little odd for me.
>
> 3a. There is an attempt to check a correctness of a numeric value
> instead of using numeric_in which does the same thing. In the last
> case the code looks more compact and nice (see the full explanation in
> the previous review[1] and both PoC there).
>
> Possibly that way is chosen to emit its own error messages ("invalid
> size:" instead of "invalid input syntax for type numeric:"), but it
> still leads errors like 'invalid unit: "+ kB"' for input value "+912+
> kB" (see "dbsize.out").
>

In tried to enhanced this point (I don't agree, so this is a bug)


>
> 3b. There is an extra copying from the buffer "str" to the "buffer"
> (it can be avoided and shown in an PoC "single_buffer.c").
>
> 3c. There is freeing buffers "str" and "buffer" that looks as a
> counter-design since MemoryContext frees them at once with other
> expired buffers (like "num" from numeric_in and numeric_mul; also
> "mul_num" from int8_numeric)
>

Numeric operations are on critical path for OLAP applications - so the code
is rewritten to minimize palloc/free operations. I don't expect usage
pg_size_bytes in OLAP critical path.


>
>
> ===
> While I was writing that text I found five additional notes. Both
> parts (above and below) are important (see concusion).
>
> I. The new version of the patch (pg-size-bytes-10.patch) has the
> simplest typo in an updated (from v9) row "inavalid size".
>

fixed

>
> II. There is no test for an empty input value, but it works as
> expected (an error 'invalid size: ""').
>

fixed

>
> III. There is no support of 'bytes' unit, it seems such behavior got
> majority approval[2].
>

No, because I have to use the supported units by configuration. The
configuration supports only three chars long units. Support for "bytes" was
removed, when I removed proprietary unit table.


>
> IV. Code design of the function "parse_memory_unit" is returning
> "bool" value and set "multiplier" via pointer. But while Tom Lane was
> reviewed my patch[3] in the similar case he changed code that now
> returns (see NonFiniteTimestampTzPart) not a flag, but a value (where
> zero still means an error). Be honest it looks better, but the current
> code is written like nearby functions.
>

It is a commiter decision. This is a implementation detail, and I don't see
any variant as significant better - using 0 as error flag isn't the best
too. But this is really detail.

>
> V. The documentation lacks a note that the base of the "pg_size_bytes"
> is 1024 whereas the base of the "pg_size_pretty" is 1000.
>

It isn't true, base for both are 1024. It was designed when special table
was used for pg_size_bytes. But when we share same control table, then is
wrong to use different table. The result can be optically different, but
semantically same.

postgres=# select pg_size_pretty(pg_size_bytes('1MB'));
┌────────────────┐
│ pg_size_pretty │
╞════════════════╡
│ 1024 kB        │
└────────────────┘
(1 row)

Time: 0.787 ms
postgres=# select pg_size_pretty(pg_size_bytes('1024MB'));
┌────────────────┐
│ pg_size_pretty │
╞════════════════╡
│ 1024 MB        │
└────────────────┘
(1 row)



>
> ===
> Concusion:
>
> I *can't* mark this patch as "Ready for Committer" with all previous
> notes, and I ask experienced hackers for a help.
> I'm not a strong Postgres hacker yet and sending to the next round
> with the same notes seems a nitpicking for me.
>
> So questions to the hackers:
>
> a. Are the rest of the notes from the previous review[1] (and listed
> in pp.1-3c) a nitpicking or should be done?
>
> b. Does the function in PoC "single_buffer.c" look more complex or easer?
>
> c. Is it worth for p.II to change error message like 'Empty input
> value is not a valid size for pg_size_bytes: "%s"'?
>

It is simply to implement, but It is looking like overengineering


>
> d. Should the function "pg_size_bytes" handle "bytes" as a size unit
> (see p.III)?
> There was no disagreement, but if I saw that thread before I became a
> reviewer and saw silent approval I would have argued. Moreover
> supporting of negative sizes was explained[4] by a symmetry for the
> "pg_size_pretty", which should include all size units produced by the
> "pg_size_pretty". On the other hand those functions can't be symmetric
> due to difference in bases (see p.V and [5])
>

negative values is fully supported.

support of "bytes" depends on support "bytes" unit by GUC. When "bytes"
unit will be supported, then it can be used in pg_size_bytes immediately.

updated patch attached

Regards

Pavel


>
> e. What way is preferred for p.IV?
>
> f. If p.V is actual, how to write it in the documentation (be honest
> I've no idea)? Should a description of the "pg_size_pretty" be changed
> too?
>
>
> [1]
> http://www.postgresql.org/message-id/CAKOSWNn=p8gx-p5y-b4t4dg-aahatup+crg41hq9beobzwx...@mail.gmail.com
> [2]
> http://www.postgresql.org/message-id/cacaco5qw7ffsffhkstjtycp4qf3oh9za14sc6z3dxx2yssj...@mail.gmail.com
> [3]
> http://git.postgresql.org/pg/commitdiff/647d87c56ab6da70adb753c08d7cdf7ee905ea8a
> [4]
> http://www.postgresql.org/message-id/ca+tgmozfomg4eyorzzgf7pzotg9pxpuhtqvxlfskim4izh8...@mail.gmail.com
> [5]
> http://www.postgresql.org/message-id/ca+tgmoygs_xixuqy+xyw9futtmrtbc_s0c1c0_wobwvbnf8...@mail.gmail.com
>

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
new file mode 100644
index 9c143b2..c170b78
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*************** postgres=# SELECT * FROM pg_xlogfile_nam
*** 17724,17729 ****
--- 17724,17732 ----
      <primary>pg_relation_size</primary>
     </indexterm>
     <indexterm>
+     <primary>pg_size_bytes</primary>
+    </indexterm>
+    <indexterm>
      <primary>pg_size_pretty</primary>
     </indexterm>
     <indexterm>
*************** postgres=# SELECT * FROM pg_xlogfile_nam
*** 17795,17800 ****
--- 17798,17812 ----
        </row>
        <row>
         <entry>
+         <literal><function>pg_size_bytes(<type>text</type>)</function></literal>
+         </entry>
+        <entry><type>bigint</type></entry>
+        <entry>
+          Converts a size in human-readable format with size units into bytes.
+        </entry>
+       </row>
+       <row>
+        <entry>
          <literal><function>pg_size_pretty(<type>bigint</type>)</function></literal>
          </entry>
         <entry><type>text</type></entry>
*************** postgres=# SELECT * FROM pg_xlogfile_nam
*** 17930,17935 ****
--- 17942,17955 ----
     </para>
  
     <para>
+     <function>pg_size_bytes</> can be used to get a size in bytes as
+     bigint from a human-readable format for a comparison purposes (it is
+     the opposite to <function>pg_size_pretty</> function). The format is
+     numeric with an optional size unit: kB, MB, GB or TB. The unit string
+     is case insensitive.
+    </para>
+ 
+    <para>
      The functions above that operate on tables or indexes accept a
      <type>regclass</> argument, which is simply the OID of the table or index
      in the <structname>pg_class</> system catalog.  You do not have to look up
diff --git a/src/backend/utils/adt/dbsize.c b/src/backend/utils/adt/dbsize.c
new file mode 100644
index 2084692..da8db86
*** a/src/backend/utils/adt/dbsize.c
--- b/src/backend/utils/adt/dbsize.c
***************
*** 25,30 ****
--- 25,31 ----
  #include "storage/fd.h"
  #include "utils/acl.h"
  #include "utils/builtins.h"
+ #include "utils/guc.h"
  #include "utils/numeric.h"
  #include "utils/rel.h"
  #include "utils/relfilenodemap.h"
*************** pg_size_pretty_numeric(PG_FUNCTION_ARGS)
*** 700,705 ****
--- 701,855 ----
  }
  
  /*
+  * Convert human readable size to long int.
+  *
+  * Due to support for decimal values and case insensitivity of units
+  * a function parse_int cannot be used.
+  */
+ Datum
+ pg_size_bytes(PG_FUNCTION_ARGS)
+ {
+ 	text	   *arg = PG_GETARG_TEXT_PP(0);
+ 	char	   *str,
+ 			   *strptr;
+ 	char	   *buffer,
+ 			   *bufptr;
+ 	Numeric		num;
+ 	int64		result;
+ 	bool		found_digit = false;
+ 	bool		found_point = false;
+ 
+ 	/* The content of str should be immutable, it is used in error messages. */
+ 	str = text_to_cstring(arg);
+ 	strptr = str;
+ 
+ 	/* prepare buffer for parts of parsed input string */
+ 	buffer = (char *) palloc(strlen(str) + 1);
+ 	bufptr = buffer;
+ 
+ 	/* Skip leading spaces */
+ 	while (isspace((unsigned char) *strptr))
+ 		strptr++;
+ 
+ 	switch (*strptr)
+ 	{
+ 		case '+':
+ 		case '-':
+ 			*bufptr++ = *strptr++;
+ 			break;
+ 	}
+ 
+ 	while(*strptr != '\0')
+ 	{
+ 		if (isdigit((unsigned char) *strptr))
+ 		{
+ 			*bufptr++ = *strptr++;
+ 			found_digit = true;
+ 		}
+ 		else if (*strptr == '.')
+ 		{
+ 			if (found_point)
+ 				ereport(ERROR,
+ 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 						 errmsg("invalid size: \"%s\"", str)));
+ 
+ 			*bufptr++ = *strptr++;
+ 			found_point = true;
+ 		}
+ 		else
+ 			/* any non digit char, the unit is starting */
+ 			break;
+ 	}
+ 
+ 	if (!found_digit)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 				 errmsg("invalid size: \"%s\"", str)));
+ 
+ 	*bufptr = '\0';
+ 
+ 	/* don't allow empty string */
+ 	if (*buffer == '\0')
+ 			ereport(ERROR,
+ 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 					 errmsg("invalid size: \"%s\"", str)));
+ 
+ 	num = DatumGetNumeric(DirectFunctionCall3(numeric_in,
+ 							CStringGetDatum(buffer), 0, -1));
+ 
+ 	/* allow whitespace between number and unit */
+ 	while (isspace((unsigned char) *strptr))
+ 		strptr++;
+ 
+ 	/* Handle possible unit */
+ 	if (*strptr != '\0')
+ 	{
+ 		int				multiplier;
+ 		Numeric			mul_num;
+ 		const char	   *hintmsg;
+ 		const char	   *unitstr;
+ 
+ 		bufptr = buffer;
+ 
+ 		/*
+ 		 * unitstr holds complete string related to unit part. Used
+ 		 * as part of error message. The parse_memory_unit is called
+ 		 * with this value, when we detected invalid format, and we
+ 		 * would to emit error result to get hintstr. Otherwise
+ 		 * trimmed value is used.
+ 		 */
+ 		unitstr = strptr;
+ 
+ 		/* copy chars to buffer and stop on first space or end string */
+ 		while (*strptr != '\0' && !isspace((unsigned char) *strptr))
+ 		{
+ 			/* only alphabetic character are allowed */
+ 			if (!isalpha((unsigned char) *strptr))
+ 				ereport(ERROR,
+ 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 					 errmsg("invalid size: \"%s\"", str)));
+ 
+ 			*bufptr++ = *strptr++;
+ 		}
+ 
+ 		*bufptr = '\0';
+ 
+ 		/* We allow trailing spaces. Skip all spaces. */
+ 		while (isspace((unsigned char) *strptr))
+ 			strptr++;
+ 
+ 		/* Use buffer as unitstr, when we didn't find more words. */
+ 		if (*strptr == '\0')
+ 			unitstr = buffer;
+ 
+ 		/* Still, the unit can be invalid: too long, unknown */
+ 		if (!parse_memory_unit(unitstr, &multiplier, &hintmsg))
+ 			ereport(ERROR,
+ 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 					 errmsg("invalid unit: \"%s\"", unitstr),
+ 					 errhint("%s", _(hintmsg))));
+ 
+ 		/*
+ 		 * Now, the multiplier is in KB units. It should be multiplied by 1024
+ 		 * before usage.
+ 		 */
+ 		mul_num = DatumGetNumeric(DirectFunctionCall1(int8_numeric,
+ 						    Int64GetDatum(multiplier * 1024L)));
+ 
+ 		num = DatumGetNumeric(DirectFunctionCall2(numeric_mul,
+ 							    NumericGetDatum(mul_num),
+ 							    NumericGetDatum(num)));
+ 	}
+ 
+ 	result = DatumGetInt64(DirectFunctionCall1(numeric_int8, NumericGetDatum(num)));
+ 
+ 	pfree(buffer);
+ 	pfree(str);
+ 
+ 	PG_RETURN_INT64(result);
+ }
+ 
+ /*
   * Get the filenode of a relation
   *
   * This is expected to be used in queries like
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index 38ba82f..4020df7
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
*************** convert_from_base_unit(int64 base_value,
*** 5238,5243 ****
--- 5238,5272 ----
  
  
  /*
+  * Parse value as some known memory unit to their size in bytes.
+  * Used in pg_size_bytes function. Against convert_to_base_unit, a string
+  * comparation is case insensitive.
+  */
+ bool
+ parse_memory_unit(const char *unit, int *multiplier,
+ 					    const char **hintmsg)
+ {
+ 	int			i;
+ 
+ 	for (i = 0; *memory_unit_conversion_table[i].unit; i++)
+ 	{
+ 		const unit_conversion *conv = &memory_unit_conversion_table[i];
+ 
+ 		if (conv->base_unit == GUC_UNIT_KB &&
+ 			strcasecmp(unit, conv->unit) == 0)
+ 		{
+ 			*multiplier = conv->multiplier;
+ 			return true;
+ 		}
+ 	}
+ 
+ 	*hintmsg = memory_units_hint;
+ 
+ 	return false;
+ }
+ 
+ 
+ /*
   * Try to parse value as an integer.  The accepted formats are the
   * usual decimal, octal, or hexadecimal formats, optionally followed by
   * a unit name if "flags" indicates a unit is allowed.
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
new file mode 100644
index 79e92ff..5921bac
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
*************** DATA(insert OID = 2288 ( pg_size_pretty
*** 3587,3592 ****
--- 3587,3594 ----
  DESCR("convert a long int to a human readable text using size units");
  DATA(insert OID = 3166 ( pg_size_pretty			PGNSP PGUID 12 1 0 0 0 f f f f t f v s 1 0 25 "1700" _null_ _null_ _null_ _null_ _null_ pg_size_pretty_numeric _null_ _null_ _null_ ));
  DESCR("convert a numeric to a human readable text using size units");
+ DATA(insert OID = 3331 ( pg_size_bytes			PGNSP PGUID 12 1 0 0 0 f f f f t f i s 1 0 20 "25" _null_ _null_ _null_ _null_ _null_ pg_size_bytes _null_ _null_ _null_ ));
+ DESCR("convert a size in human-readable format with size units into bytes");
  DATA(insert OID = 2997 ( pg_table_size			PGNSP PGUID 12 1 0 0 0 f f f f t f v s 1 0 20 "2205" _null_ _null_ _null_ _null_ _null_ pg_table_size _null_ _null_ _null_ ));
  DESCR("disk space usage for the specified table, including TOAST, free space and visibility map");
  DATA(insert OID = 2998 ( pg_indexes_size		PGNSP PGUID 12 1 0 0 0 f f f f t f v s 1 0 20 "2205" _null_ _null_ _null_ _null_ _null_ pg_indexes_size _null_ _null_ _null_ ));
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
new file mode 100644
index c2e529f..bb21615
*** a/src/include/utils/builtins.h
--- b/src/include/utils/builtins.h
*************** extern Datum pg_relation_size(PG_FUNCTIO
*** 470,475 ****
--- 470,476 ----
  extern Datum pg_total_relation_size(PG_FUNCTION_ARGS);
  extern Datum pg_size_pretty(PG_FUNCTION_ARGS);
  extern Datum pg_size_pretty_numeric(PG_FUNCTION_ARGS);
+ extern Datum pg_size_bytes(PG_FUNCTION_ARGS);
  extern Datum pg_table_size(PG_FUNCTION_ARGS);
  extern Datum pg_indexes_size(PG_FUNCTION_ARGS);
  extern Datum pg_relation_filenode(PG_FUNCTION_ARGS);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
new file mode 100644
index e1de1a5..3bfe0f4
*** a/src/include/utils/guc.h
--- b/src/include/utils/guc.h
*************** extern int	NewGUCNestLevel(void);
*** 357,362 ****
--- 357,364 ----
  extern void AtEOXact_GUC(bool isCommit, int nestLevel);
  extern void BeginReportingGUCOptions(void);
  extern void ParseLongOption(const char *string, char **name, char **value);
+ extern bool parse_memory_unit(const char *unit, int *multiplier,
+ 					    const char **hintmsg);
  extern bool parse_int(const char *value, int *result, int flags,
  		  const char **hintmsg);
  extern bool parse_real(const char *value, double *result);
diff --git a/src/test/regress/expected/dbsize.out b/src/test/regress/expected/dbsize.out
new file mode 100644
index aa513e7..e44a67c
*** a/src/test/regress/expected/dbsize.out
--- b/src/test/regress/expected/dbsize.out
***************
*** 1,3 ****
--- 1,6 ----
+ -- These functions shares memory_unit_conversion_table.
+ -- Currently only max TB unit is supported. When you increase
+ -- supported unit, update related documentation too.
  SELECT size, pg_size_pretty(size), pg_size_pretty(-1 * size) FROM
      (VALUES (10::bigint), (1000::bigint), (1000000::bigint),
              (1000000000::bigint), (1000000000000::bigint),
*************** SELECT size, pg_size_pretty(size), pg_si
*** 35,37 ****
--- 38,106 ----
   1000000000000000.5 | 909 TB         | -909 TB
  (12 rows)
  
+ SELECT pg_size_bytes(size) FROM
+    (VALUES('1'), ('1kB'), ('1MB'), (' 1 GB'), ('1.5 GB '),
+           ('1TB'), ('3000 TB')) x(size);
+   pg_size_bytes   
+ ------------------
+                 1
+              1024
+           1048576
+        1073741824
+        1610612736
+     1099511627776
+  3298534883328000
+ (7 rows)
+ 
+ -- case insensitive units are supported
+ SELECT pg_size_bytes(size) FROM
+    (VALUES('1'), ('1kb'), ('1mb'), (' 1 Gb'), ('1.5 gB '),
+           ('1tb')) x(size);
+  pg_size_bytes 
+ ---------------
+              1
+           1024
+        1048576
+     1073741824
+     1610612736
+  1099511627776
+ (6 rows)
+ 
+ -- negative numbers are supported
+ SELECT pg_size_bytes(size) FROM
+    (VALUES('-1'), ('-1kb'), ('-1mb'), (' -1 Gb'), ('-1.5 gB '),
+           ('-1tb')) x(size);
+  pg_size_bytes  
+ ----------------
+              -1
+           -1024
+        -1048576
+     -1073741824
+     -1610612736
+  -1099511627776
+ (6 rows)
+ 
+ --should fail
+ SELECT pg_size_bytes('1 AB');
+ ERROR:  invalid unit: "AB"
+ HINT:  Valid units for this parameter are "kB", "MB", "GB", and "TB".
+ SELECT pg_size_bytes('1 AB A');
+ ERROR:  invalid unit: "AB A"
+ HINT:  Valid units for this parameter are "kB", "MB", "GB", and "TB".
+ select pg_size_bytes('9223372036854775807.9');
+ ERROR:  bigint out of range
+ select pg_size_bytes('1024 bytes');
+ ERROR:  invalid unit: "bytes"
+ HINT:  Valid units for this parameter are "kB", "MB", "GB", and "TB".
+ select pg_size_bytes('');
+ ERROR:  invalid size: ""
+ select pg_size_bytes('.+912');
+ ERROR:  invalid size: ".+912"
+ select pg_size_bytes('+912+ kB');
+ ERROR:  invalid size: "+912+ kB"
+ select pg_size_bytes('++123 kB');
+ ERROR:  invalid size: "++123 kB"
+ --only TB and less are supported
+ SELECT pg_size_bytes('1PB');
+ ERROR:  invalid unit: "PB"
+ HINT:  Valid units for this parameter are "kB", "MB", "GB", and "TB".
diff --git a/src/test/regress/sql/dbsize.sql b/src/test/regress/sql/dbsize.sql
new file mode 100644
index c118090..5e47302
*** a/src/test/regress/sql/dbsize.sql
--- b/src/test/regress/sql/dbsize.sql
***************
*** 1,3 ****
--- 1,7 ----
+ -- These functions shares memory_unit_conversion_table.
+ -- Currently only max TB unit is supported. When you increase
+ -- supported unit, update related documentation too.
+ 
  SELECT size, pg_size_pretty(size), pg_size_pretty(-1 * size) FROM
      (VALUES (10::bigint), (1000::bigint), (1000000::bigint),
              (1000000000::bigint), (1000000000000::bigint),
*************** SELECT size, pg_size_pretty(size), pg_si
*** 10,12 ****
--- 14,44 ----
              (10.5::numeric), (1000.5::numeric), (1000000.5::numeric),
              (1000000000.5::numeric), (1000000000000.5::numeric),
              (1000000000000000.5::numeric)) x(size);
+ 
+ SELECT pg_size_bytes(size) FROM
+    (VALUES('1'), ('1kB'), ('1MB'), (' 1 GB'), ('1.5 GB '),
+           ('1TB'), ('3000 TB')) x(size);
+ 
+ -- case insensitive units are supported
+ SELECT pg_size_bytes(size) FROM
+    (VALUES('1'), ('1kb'), ('1mb'), (' 1 Gb'), ('1.5 gB '),
+           ('1tb')) x(size);
+ 
+ -- negative numbers are supported
+ SELECT pg_size_bytes(size) FROM
+    (VALUES('-1'), ('-1kb'), ('-1mb'), (' -1 Gb'), ('-1.5 gB '),
+           ('-1tb')) x(size);
+ 
+ --should fail
+ SELECT pg_size_bytes('1 AB');
+ SELECT pg_size_bytes('1 AB A');
+ select pg_size_bytes('9223372036854775807.9');
+ select pg_size_bytes('1024 bytes');
+ select pg_size_bytes('');
+ 
+ select pg_size_bytes('.+912');
+ select pg_size_bytes('+912+ kB');
+ select pg_size_bytes('++123 kB');
+ 
+ --only TB and less are supported
+ SELECT pg_size_bytes('1PB');

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] custom function for converting human readable sizes to bytes

Reply via email to