Re: [Jbeta] bug in regexp support distributed with J

Oleg Kobchenko Fri, 15 Jan 2010 23:34:03 -0800

I am not sure, but an "overlapping" rxmatches could be achieved
by skipping to the next char from the start of last match,
as opposed to the end.


However, looking at the code of rxmatches, we see that in order
to proceed to the remainder of the input, it uses }. Drop
over and over. I don't know if it is optimized in J not to
create copies of the data, but in any case there is a better
way of supplying an offset instead. Indeed it is possible
using the native PCRE API:

       int pcre_exec(const pcre *code, const pcre_extra *extra,
            const char *subject, int length, int startoffset,
            int options, int *ovector, int ovecsize);

However the POSIX API, which J exposes from the jpcre DLL
does not allow offset:

       int regexec(regex_t *preg, const char *string,
            size_t nmatch, regmatch_t pmatch[], int eflags);

So maybe it makes sense to switch to native PCRE API in
jpcre DLL and enhance the corresponding verbs to allow offset,
which will improve the rxmatches (both cases) performance?



> From: Oleg Kobchenko <[email protected]>
> 
> > From: Raul Miller 
> > 
> > On Fri, Jan 15, 2010 at 3:26 AM, Oleg Kobchenko wrote:
> > >> From: Raul Miller 
> > >> On Thu, Jan 14, 2010 at 6:39 PM, Oleg Kobchenko wrote:
> > >> > Does it behave differently in Perl?
> > >>
> > >> Perl finds non-overlapping matches by default, but
> > >> lets you restart the match at any given position so
> > >> you can easily implement the overlapping matches
> > >> case.
> > >
> > > What does it mean "restart the match"?
> > >
> > > Maybe you should do the same in J?
> > 
> > When matching in perl, you can have the
> > regexp start at a specific index.  Since you
> > know where the previous match began,
> > you can start again at the following character.
> > 
> > To do this in J would require forming an
> > explicit loop using rxmatch instead of
> > rxmatches and extracting the appropriate
> > substrings, all of which would be orders
> > of magnitude slower than the perl approach.
> > 
> > >> > It looks like non-overlapping makes more sense.
> > >>
> > >> Both have uses.
> > >
> > > In any case it looks like not a bug.
> > 
> > So I posted this to the wrong list?
> 
> "Not a bug" means seeking a feature not intended
> by design.
> 
> > But if this is not a bug in rxmatches, it is
> > then a bug in the documentation for
> > rxmatches, since rxmatches does not
> > actually return "all matches".
> 
> We need to distinguish a Match from a regex Group
> (parenthesized and optionally names sub-match).
> Groups can be nested (wholly overlapped by the outer),
> but not partially overlapped either.
> 
> Note: in regex
>   (ab)|(cd)
> the parens are redundant (unless you want to signal
> which alternative triggered). Using "|" makes the two
> (or more parts) mutually exclusive (or disjunctive), ie
> it's either all whole one or whole other sub-expression
> that is matched.
> 
> What rematches does is it finds a match one after another,
> and when a match is found, it consumes the input and
> the next match starts after the last consumed character.
> So it is by definition non-overlapping.
> 
> I do not even know the concept of "overlapping" matches.
> In general purpose parsing, each character can only be
> in up to one match.
> 
> Is there any theory or references or code samples of
> "overlapped" matches?
> 


      
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jbeta] bug in regexp support distributed with J

Reply via email to