Re: [PATCHES] [HACKERS] writing new regexp functions

Jeremy Drake Sun, 04 Feb 2007 13:00:30 -0800

On Sun, 4 Feb 2007, David Fetter wrote:

> On Fri, Feb 02, 2007 at 07:01:33PM -0800, Jeremy Drake wrote:
>
> > Let me know if you see any bugs or issues with this code, and I am
> > open to suggestions for further regression tests ;)
>
> > Things that I still want to look into:
> > * regexp flags (a la regexp_replace).
>
> One more text field at the end is how the regexp_replace() one does
> it.


That's how I did it.

> > * maybe make regexp_matches return setof whatever, if given a 'g' flag
> >   return all matches in string.
>
> This is doable with current machinery, albeit a little clumsily.

I have implemented this too.

> > * maybe a join function that works as an aggregate
> >    SELECT join(',', col) FROM tbl
> >   currently can be written as
> >    SELECT array_to_string(ARRAY(SELECT col FROM tbl), ',')
>
> The array_accum() aggregate in the docs works OK for this purpose.

I have not tackled this yet, I think it may be better to stick with the
ARRAY() construct for now.


So, here is the new version of the code, and also a new version of the
patch to core, which fixes some compile warnings that I did not see at
first because I was using ICC rather than GCC.

Here is the README.regexp_ext from the tar file:


This package contains regexp functions beyond those currently provided
in core PostgreSQL, utilizing the regexp engine built into core.  This
is still a work-in-progress.

The most recent version of this code can be found at
 http://www.jdrake.com/postgresql/regexp/regexp_ext.tar.gz
and the prerequisite patch to PostgreSQL core, which has been submitted
for review, can be found at
 http://www.jdrake.com/postgresql/regexp/regexp-export.patch

The .tar.gz file expects to be untarred in contrib/.  I have made some
regression tests that can be run using 'make installcheck' as normal for
contrib.  I think they exercise the corner cases in the code, but I may
very well have missed some.  It requires the above mentioned patch to
core to compile, as it takes advantage of new exported functions from
src/backend/utils/adt/regexp.c.

Let me know if you see any bugs or issues with this code, and I am open to
suggestions for further regression tests ;)

Functions implemented in this module:
* regexp_split(str text, pattern text) RETURNS SETOF text
  regexp_split(str text, pattern text, flags text) RETURNS SETOF text
   returns each section of the string delimited by the pattern.
* regexp_matches(str text, pattern text) RETURNS text[]
   returns all capture groups when matching pattern against string in an array
* regexp_matches(str text, pattern text, flags text) RETURNS SETOF
    (prematch text, fullmatch text, matches text[], postmatch text)
   returns all capture groups when matching pattern against string in an array.
   also returns the entire match in fullmatch.  if the 'g' option is given,
   returns all matches in the string.  if the 'r' option is given, also return
   the text before and after the match in prematch and postmatch respectively.

See the regression tests for more details about usage and return values.

Recent changes:
* I have put the pattern after the string in all of the functions, as
  discussed on the pgsql-hackers mailing list.

* regexp flags (a la regexp_replace).

* make regexp_matches return setof whatever, if given a 'g' flag return
  all matches in string.

Things that I still want to look into:
* maybe a join function that works as an aggregate
   SELECT join(',', col) FROM tbl
  currently can be written as
   SELECT array_to_string(ARRAY(SELECT col FROM tbl), ',')


-- 
Philogeny recapitulates erogeny; erogeny recapitulates philogeny.

Index: src/backend/utils/adt/regexp.c
===================================================================
RCS file: 
/home/jeremyd/local/postgres/cvsuproot/pgsql/src/backend/utils/adt/regexp.c,v
retrieving revision 1.68
diff -c -r1.68 regexp.c
*** src/backend/utils/adt/regexp.c      5 Jan 2007 22:19:41 -0000       1.68
--- src/backend/utils/adt/regexp.c      4 Feb 2007 07:58:26 -0000
***************
*** 29,41 ****
   */
  #include "postgres.h"
  
- #include "regex/regex.h"
  #include "utils/builtins.h"
  #include "utils/guc.h"
  
  
  /* GUC-settable flavor parameter */
! static int    regex_flavor = REG_ADVANCED;
  
  
  /*
--- 29,41 ----
   */
  #include "postgres.h"
  
  #include "utils/builtins.h"
  #include "utils/guc.h"
+ #include "utils/regexp.h"
  
  
  /* GUC-settable flavor parameter */
! int   regex_flavor = REG_ADVANCED;
  
  
  /*
***************
*** 90,96 ****
   * Pattern is given in the database encoding.  We internally convert to
   * array of pg_wchar which is what Spencer's regex package wants.
   */
! static regex_t *
  RE_compile_and_cache(text *text_re, int cflags)
  {
        int                     text_re_len = VARSIZE(text_re);
--- 90,96 ----
   * Pattern is given in the database encoding.  We internally convert to
   * array of pg_wchar which is what Spencer's regex package wants.
   */
! regex_t *
  RE_compile_and_cache(text *text_re, int cflags)
  {
        int                     text_re_len = VARSIZE(text_re);
***************
*** 191,238 ****
  }
  
  /*
!  * RE_compile_and_execute - compile and execute a RE
   *
   * Returns TRUE on match, FALSE on no match
   *
!  *    text_re --- the pattern, expressed as an *untoasted* TEXT object
!  *    dat --- the data to match against (need not be null-terminated)
!  *    dat_len --- the length of the data string
!  *    cflags --- compile options for the pattern
   *    nmatch, pmatch  --- optional return area for match details
   *
!  * Both pattern and data are given in the database encoding.  We internally
!  * convert to array of pg_wchar which is what Spencer's regex package wants.
   */
! static bool
! RE_compile_and_execute(text *text_re, char *dat, int dat_len,
!                                          int cflags, int nmatch, regmatch_t 
*pmatch)
  {
-       pg_wchar   *data;
-       size_t          data_len;
        int                     regexec_result;
-       regex_t    *re;
        char            errMsg[100];
  
-       /* Convert data string to wide characters */
-       data = (pg_wchar *) palloc((dat_len + 1) * sizeof(pg_wchar));
-       data_len = pg_mb2wchar_with_len(dat, data, dat_len);
- 
-       /* Compile RE */
-       re = RE_compile_and_cache(text_re, cflags);
- 
        /* Perform RE match and return result */
        regexec_result = pg_regexec(re,
                                                                data,
                                                                data_len,
!                                                               0,
                                                                NULL,   /* no 
details */
                                                                nmatch,
                                                                pmatch,
                                                                0);
  
-       pfree(data);
- 
        if (regexec_result != REG_OKAY && regexec_result != REG_NOMATCH)
        {
                /* re failed??? */
--- 191,226 ----
  }
  
  /*
!  * RE_wchar_execute - execute a RE
   *
   * Returns TRUE on match, FALSE on no match
   *
!  *    re --- the compiled pattern as returned by RE_compile_and_cache
!  *    data --- the data to match against (need not be null-terminated)
!  *    data_len --- the length of the data string
!  *    start_search -- the offset in the data to start searching
   *    nmatch, pmatch  --- optional return area for match details
   *
!  * Data is given as array of pg_wchar which is what Spencer's regex package
!  * wants.
   */
! bool
! RE_wchar_execute(regex_t *re, pg_wchar *data, int data_len, size_t 
start_search,
!                                          int nmatch, regmatch_t *pmatch)
  {
        int                     regexec_result;
        char            errMsg[100];
  
        /* Perform RE match and return result */
        regexec_result = pg_regexec(re,
                                                                data,
                                                                data_len,
!                                                               start_search,
                                                                NULL,   /* no 
details */
                                                                nmatch,
                                                                pmatch,
                                                                0);
  
        if (regexec_result != REG_OKAY && regexec_result != REG_NOMATCH)
        {
                /* re failed??? */
***************
*** 245,250 ****
--- 233,295 ----
        return (regexec_result == REG_OKAY);
  }
  
+ /*
+  * RE_execute - execute a RE
+  *
+  * Returns TRUE on match, FALSE on no match
+  *
+  *    re --- the compiled pattern as returned by RE_compile_and_cache
+  *    dat --- the data to match against (need not be null-terminated)
+  *    dat_len --- the length of the data string
+  *    nmatch, pmatch  --- optional return area for match details
+  *
+  * Data is given in the database encoding.  We internally
+  * convert to array of pg_wchar which is what Spencer's regex package wants.
+  */
+ bool
+ RE_execute(regex_t *re, char *dat, int dat_len,
+                                          int nmatch, regmatch_t *pmatch)
+ {
+       pg_wchar   *data;
+       size_t          data_len;
+       bool            match;
+ 
+       /* Convert data string to wide characters */
+       data = (pg_wchar *) palloc((dat_len + 1) * sizeof(pg_wchar));
+       data_len = pg_mb2wchar_with_len(dat, data, dat_len);
+ 
+       /* Perform RE match and return result */
+       match = RE_wchar_execute(re, data, data_len, 0, nmatch, pmatch);
+       pfree(data);
+       return match;
+ }
+ 
+ /*
+  * RE_compile_and_execute - compile and execute a RE
+  *
+  * Returns TRUE on match, FALSE on no match
+  *
+  *    text_re --- the pattern, expressed as an *untoasted* TEXT object
+  *    dat --- the data to match against (need not be null-terminated)
+  *    dat_len --- the length of the data string
+  *    cflags --- compile options for the pattern
+  *    nmatch, pmatch  --- optional return area for match details
+  *
+  * Both pattern and data are given in the database encoding.  We internally
+  * convert to array of pg_wchar which is what Spencer's regex package wants.
+  */
+ bool
+ RE_compile_and_execute(text *text_re, char *dat, int dat_len,
+                                          int cflags, int nmatch, regmatch_t 
*pmatch)
+ {
+       regex_t    *re;
+ 
+       /* Compile RE */
+       re = RE_compile_and_cache(text_re, cflags);
+ 
+       return RE_execute(re, dat, dat_len, nmatch, pmatch);
+ }
+ 
  
  /*
   * assign_regex_flavor - GUC hook to validate and set REGEX_FLAVOR
Index: src/backend/utils/adt/varlena.c
===================================================================
RCS file: 
/home/jeremyd/local/postgres/cvsuproot/pgsql/src/backend/utils/adt/varlena.c,v
retrieving revision 1.154
diff -c -r1.154 varlena.c
*** src/backend/utils/adt/varlena.c     5 Jan 2007 22:19:42 -0000       1.154
--- src/backend/utils/adt/varlena.c     2 Feb 2007 02:50:31 -0000
***************
*** 23,32 ****
  #include "libpq/pqformat.h"
  #include "miscadmin.h"
  #include "parser/scansup.h"
- #include "regex/regex.h"
  #include "utils/builtins.h"
  #include "utils/lsyscache.h"
  #include "utils/pg_locale.h"
  
  
  typedef struct varlena unknown;
--- 23,32 ----
  #include "libpq/pqformat.h"
  #include "miscadmin.h"
  #include "parser/scansup.h"
  #include "utils/builtins.h"
  #include "utils/lsyscache.h"
  #include "utils/pg_locale.h"
+ #include "utils/regexp.h"
  
  
  typedef struct varlena unknown;
***************
*** 2355,2386 ****
        search_start = 0;
        while (search_start <= data_len)
        {
-               int                     regexec_result;
- 
                CHECK_FOR_INTERRUPTS();
  
!               regexec_result = pg_regexec(re,
!                                                                       data,
!                                                                       
data_len,
!                                                                       
search_start,
!                                                                       NULL,   
        /* no details */
!                                                                       
REGEXP_REPLACE_BACKREF_CNT,
!                                                                       pmatch,
!                                                                       0);
! 
!               if (regexec_result == REG_NOMATCH)
                        break;
  
-               if (regexec_result != REG_OKAY)
-               {
-                       char            errMsg[100];
- 
-                       pg_regerror(regexec_result, re, errMsg, sizeof(errMsg));
-                       ereport(ERROR,
-                                       
(errcode(ERRCODE_INVALID_REGULAR_EXPRESSION),
-                                        errmsg("regular expression failed: 
%s", errMsg)));
-               }
- 
                /*
                 * Copy the text to the left of the match position.  Note we are
                 * given character not byte indexes.
--- 2355,2366 ----
        search_start = 0;
        while (search_start <= data_len)
        {
                CHECK_FOR_INTERRUPTS();
  
!               if (!RE_wchar_execute (re, data, data_len, search_start,
!                                                       
REGEXP_REPLACE_BACKREF_CNT, pmatch))
                        break;
  
                /*
                 * Copy the text to the left of the match position.  Note we are
                 * given character not byte indexes.
*** ../pgsql-orig/src/include/utils/regexp.h    Wed Dec 31 16:00:00 1969
--- src/include/utils/regexp.h  Thu Feb  1 18:46:49 2007
***************
*** 0 ****
--- 1,29 ----
+ /*-------------------------------------------------------------------------
+  *
+  * regexp.h
+  *      Header file for regexp connector code.
+  *
+  * Copyright (c) 2007, PostgreSQL Global Development Group
+  *
+  * $PostgreSQL$
+  *
+  *-------------------------------------------------------------------------
+  */ 
+ #ifndef REGEXP_H
+ #define REGEXP_H
+ 
+ #include "regex/regex.h"
+ 
+ /* regexp support routines for PostgreSQL-izing regexp code */
+ extern regex_t * RE_compile_and_cache(text *text_re, int cflags);
+ extern bool RE_compile_and_execute(text *text_re, char *dat, int dat_len,
+                                          int cflags, int nmatch, regmatch_t 
*pmatch);
+ extern bool RE_wchar_execute(regex_t *re, pg_wchar *data, int data_len,
+                                          size_t start_search, int nmatch, 
regmatch_t *pmatch);
+ extern bool RE_execute(regex_t *re, char *dat, int dat_len,
+                                          int nmatch, regmatch_t *pmatch);
+ 
+ /* regexp flavor GUC variable */
+ extern int regex_flavor;
+ 
+ #endif   /* REGEXP_H */

regexp_ext.tar.gz
Description: Binary data

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to [EMAIL PROTECTED] so that your
       message can get through to the mailing list cleanly

Re: [PATCHES] [HACKERS] writing new regexp functions

Reply via email to