Re: GSoC Regexp engine

Nikolai Weibull Thu, 31 May 2007 09:55:48 -0700

On 5/31/07, Ian Young <[EMAIL PROTECTED]> wrote:

I'm Ian, one of the two students working on improving the regexp
engine in Vim for this year's Google Summer of Code.  I haven't had a
whole lot to contribute as of yet, but now that work is underway, I'll
probably pop up here asking lots of questions some days.

Right now we're working on getting things set up and building a
testing suite, but I thought I would spark some discussion on a design
decision that will be coming up after we finish this phase, which is
whether to implement the new model ourselves, or use an alternative
engine, like TRE: <http://laurikari.net/tre/>. I'm tempted to
implement one ourselves, as it's an intellectually stimulating
prospect, but that doesn't mean I won't listen to reason if TRE or
another option is far better. I don't know much about the internals of
TRE, but according to previous posts to this list, it utilizes three
engines: a slow one for handling backreferences (presumably similar to
Vim's current engine), a fast one for most cases (what we are looking
to implement), and one for their 'fuzzy matching' feature.

I have a couple questions to start things off. First: I couldn't see
much need for 'fuzzy matching' in Vim, but some of you are probably
much better acquainted with regexp use cases than I am.  Would this be
a useful feature to have available?

Second: We might have to do some
gymnastics to work with multibyte characters, as discussed here: <
http://tech.groups.yahoo.com/group/vimdev/message/46408>. I haven't
worked with multibyte characters before, so I'm not clear on the
subtleties.  Would this translation to wide characters before passing
to the engine cause much of a performance hit and/or be excessively
complicated to implement? On a side note, TRE's main page says it has
both wide character and multibyte character support. I couldn't find a
version history, so I'm not sure if this is a new feature that Nikolai
isn't aware of, or if we need something more.


It supports

* Byte matching, that is, raw bytes
* Wide characters, that is, whatever wchar_t is
* Multi-byte characters, thas is, whatever mbrtowc supports
* Streams that is, objects that feed TRE characters as it needs them

It would be pretty easy to set up a stream object that would feed TRE
characters.  It would only have to keep track of where it was in the
buffer and basically request more of the buffer as TRE needs it.

It should be noted that there are quite a few bugs in TRE that relate
to the interaction of quantifiers.  I have discussed this privately
with Ville, but neither of us has been able to resolve it.  It has
also been discussed here:

http://laurikari.net/pipermail/tre-general/2007-February/thread.html

where Chris Kuklewicz suggests a solution to the problem that seems to
work.  It is a somewhat costly solution, but it may be worth it in all
its simplicity.  Chris has written an implementation of TDFAs for
Haskell that is quite simple and manages to both outperform all other
regex libraries for Haskell and still pass all POSIX tests.  Here's
the announcement:

http://www.mail-archive.com/[EMAIL PROTECTED]/msg11442.html

This will, sadly, be of no use to us, but it does show that TDFAs are
a possibility, and that the problems TRE has with quantifiers can be
resolved.

Anyway, fuzzy matching, it seems like this is a feature that never
really caught on.  Agrep has long enjoyed the status of being one of
the few commands that remain to be implemented for the GNU project
(can't seem to find the list right now, so I can't provide a link).
This does, however, seem to indicate that no one has cared enough
about it to implement and distribute it with GNU.  It can be a quite
interesting thing to have, but it's perhaps not useful enough to care
about at this stage.

Also, you won't have time to implement this yourself.  Seriously.  It
takes a lot of work to write an efficient and
as-compatible-as-possible implementation implementation and a summer
isn't nearly enough time to complete said work.  I think that what's
most important here is to set up a test suite and the code required to
interface with a library, such as TRE.  That way one can always hook
in another library when it gets written.

Finally, good to hear from you. I think we all look forward to being
able to enjoy the fruits of your hard labor ;-).

 nikolai

Re: GSoC Regexp engine

Reply via email to