With regards to < https://github.com/mozilla/rust/issues/3591 >, I'd
like to write a regular expression module for Rust. I've written a
couple of regular expression engines in Python for fun in the past[*],
and #rust pressured me to utilize my perverse sense of fun to write
the same for Rust. Actually, the reason I learned Rust was to port a
bunch of my regex code to a nice language. :)

I'm writing this email because
https://github.com/mozilla/rust/wiki/Library-editing told me to. I
don't know much about the process, but as I understand it this marks
the beginning of a one week discussion period where Rust-Dev fleshes
out ideas for such a module, and whether or not it deserves to be
written, and whether or not I should be the one writing it.

I've also added this library page to the wiki:
https://github.com/mozilla/rust/wiki/Lib-re


I've already discussed this somewhat with some people in #rust,
especially Marvin Löbel (kimundi) has been interested in helping me
come up with a nice API. Hopefully we can put that down in writing
here so that it isn't just in our memory.


Some questions to start off with:

- Should rust have a new regex engine written in Rust, or should it
just have bindings for e.g. RE2 or similar?

    A point brought up in #rust: if we use RE2 or similar, we may not
    be able to have a re!() syntax extension that compiles regexps at
    the same time as the surrounding rust code.

    I prefer the former, because I wanted to write a new regex engine
    regardless. I would be perfectly happy to write some nice bindings
    for something like RE2, but I am probably not the best person to do it.

- What syntax/semantics are important?

    I would propose supporting the "usual" PCRE syntax and semantics
    (including submatch extraction), but with the exception of
    backreferences and any other features which cannot be implemented
    efficiently (i.e. polynomial time).

    RE2 has a good summary of regex syntax, although it doesn't
    specify for PCRE-family syntax whether it comes from perl, libpcre,
    python, or something else.

        http://code.google.com/p/re2/wiki/Syntax

    Note: if rust's re module is efficient, syntaxes for things like
    possessive quantifiers is pointless and can be dropped.

    It may be desirable to include alternate parse disambiguation
    strategies. Using "efficient" RE, it's fairly easy to support
    POSIX-style longest match, as well as PCRE-style matches and even
    shortest match. For example, RE2 offers support for PCRE-style and
    also POSIX style regex matching.

- How important is Unicode support and how broad should that support be?

    My understanding is that, at least as long as it can be added
    later, this is not crucial to get right correct right away.

    Unicode TR-18 defines 3 levels of Unicode support in regex
    implementations, of which only the first two are relevant. I think the
    only thing missing from core::unicode to give level 1 support is
    simple case folding.

    * https://github.com/mozilla/rust/issues/5820
    * http://www.unicode.org/reports/tr18/


That's probably enough to start off with, especially since the answer
to question 1 ties our hands on everything afterwards. Also since my
hands hurt from typing. However, there's a lot of other topics, like
what the API should look like, whether or not to support various
syntaxes, etc.. I've added a lot of links and a few additional topics
to the library proposal page.

    link again: https://github.com/mozilla/rust/wiki/Lib-re

Let me know if there's something I've left out of here or of the
library proposal page. When there's more discussion / I have more
energy, I will suggest some of my personal ideas of what I'd like in a
regex module, but somehow I don't feel that's appropriate at the top
level post.

-- Devin Jeanpierre

.. [*] Here's the work I did before

    https://bitbucket.org/devin.jeanpierre/re0/
        an attempt at getting "everything", it failed at the end since
        I couldn't do assertions in O(1) space)

    https://bitbucket.org/devin.jeanpierre/replay/
        "CS" style regexps _without_ submatch extraction; this was me
        exploring lots of implementation strategies to get ideas for solving
        the above problem. Still not complete.
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev

Reply via email to