Hi,

On Sun, Feb 28, 2021, at 03:13, David Fetter wrote:
> Maybe an int4multirange, which would fit unless I'm misunderstanding
> g's meaning with respect to non-overlapping patterns, but that might
> be a little too cute and not easy ever to extend.
> 
> Come to that, would a row structure that looked like
> 
>     (match, start, end)
> 
> be useful?

Nice, didn't know about the new multirange.
Its data structure seems like a perfect fit for this.

Hmm, I cannot find any catalog function to extract the ranges from the data 
structure though?
As a caller, I might need the exact start/end values,
not just wanting to know if a certain values overlaps any of the ranges.
Is there such a function?

Here is a PoC that just returns the start_pos and end_pos for all the matches.
It would be simple to modify it to instead return multirange.

CREATE OR REPLACE FUNCTION regexp_ranges(string text, pattern text, OUT 
start_pos integer, OUT end_pos integer)
RETURNS SETOF RECORD
LANGUAGE plpgsql
AS
$$
DECLARE
match text;
remainder text := string;
len integer;
BEGIN
end_pos := 0;
--
-- Ignore possible capture groups, instead just wrap the entire regex
-- in an enclosing capture group, which is then extracted as the first array 
element.
--
FOR match IN SELECT (regexp_matches(string,format('(%s)',pattern),'g'))[1] LOOP
  len := length(match);
  start_pos := position(match in remainder) + end_pos;
  end_pos := start_pos + len - 1;
  RETURN NEXT;
  remainder := right(remainder, -len);
END LOOP;
RETURN;
END
$$;

This works fine for small strings:

SELECT * FROM regexp_ranges('aaaa aa aaa','a+');
start_pos | end_pos
-----------+---------
         1 |       4
         6 |       7
        10 |      12
(3 rows)

Time: 0.209 ms

But quickly gets slow for longer strings:

SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',10000),'a+');
20001
Time: 98.663 ms

SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',20000),'a+');
40001
Time: 348.027 ms

SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',30000),'a+');
60001
Time: 817.412 ms

SELECT COUNT(*) FROM regexp_ranges(repeat('aaaa aa aaa',40000),'a+');
80001
Time: 1478.438 ms (00:01.478)

Compared to the much nicer observed O-notation for regexp_matches():

SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',10000),'(a+)','g');
20001
Time: 12.602 ms

SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',20000),'(a+)','g');
40001
Time: 25.161 ms

SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',30000),'(a+)','g');
60001
Time: 44.795 ms

SELECT COUNT(*) FROM regexp_matches(repeat('aaaa aa aaa',40000),'(a+)','g');
80001
Time: 57.292 ms

/Joel

Reply via email to