Recent discussions have highlighted some interesting issues. There's an aim to sync glibc with GNU regex. (Great! Any help needed? Or is it just slow because syncing with glibc is a slow process?) As a result, there seems to be a desire not to change the API or ABI further. As far as syncing with glibc goes, this is obviously a good thing. However, there are deficiencies in the GNU API. For example:
1. Lack of thread-safe API to set syntax and various other options. 2. Translation tables are 8-bit only (and don't make much sense in a wide-character context, at least in their present form, as you can't sensibly have 4Gb translation tables; the most common use is taken care of by RE_ICASE). 3. Missing functionality (no way to do plain text searching, e.g. via the RE_PLAIN flag that I proposed, or Bruno's regexp-quote). 4. Inefficiencies (e.g. for certain cases matching wide-character strings, especially UTF-8 encoding, as discussed on bug-grep, and represented in the code by e.g. the patch that converts UTF-8 back to ASCII where possible). The reason I mention this (which is not an API/ABI issue), is that it can nonetheless include large code changes which are hard to get into glibc. 5. Special-case APIs that do not obviously have a place in a general-purpose library (I'm thinking of the split-buffer functions that are only really useful to interactive editors, and editors which are implemented in a particular way at that). Emacs uses it; does anything else? 6. One library, two APIs (GNU and POSIX); some functionality is only available via one API, some via the other. The majority of application code uses the POSIX API; the majority of extra functionality is in the GNU API. This should be rectified. I'd like to see a single API, which is a backwards-compatible (API & ABI) extension of the POSIX API, thereby providing facilities like backwards searching to applications that use the POSIX API, and plain-text searching to all. The GNU API as a whole could be deprecated, and maintained only for old code (the fact that it is not currently even documented in glibc suggests that the glibc maintainers would go along with this). Hence, it seems to me there's a case for splitting the two efforts: on the one hand, to get recent improvements into glibc (and indeed, this should be an ongoing process), and on the other, to make further improvements. One way to do this would be a friendly fork of GNU regex, bearing the same relation to the gnulib version as eglibc does to glibc. But I think that these efforts can be reconciled, and that the right place to do so is in gnulib (which is already a home for general-purpose non-system code that is not in glibc, e.g. the various hash and list APIs). There are two complications: first, the desire to sync glibc with a stable version of regex (this could be overcome e.g. by marking code, e.g. by #ifdef, that is not currently synced), and secondly, the desire to retain a stable version of regex for non-GNU systems, and those with an out-of-date glibc (this can be overcome by keeping all changes API and ABI backwards-compatible). Three different sets of program benefit from this arrangement: a. GNU programs that currently use the GNU API will have a natural incentive to move to the extended POSIX API, thereby making them more portable (as then they will be closer to compiling against a "pure" POSIX implementation), or use other regex libraries which are POSIX-compatible, such as PCRE. b. GNU and non-GNU programs that already use the POSIX API can now have their capabilities extended easily. c. Some things, like plain text searching, that are not currently possible with either API, will become available, thereby simplifying application code. Further, GNU maintainers in particular benefit from a focus on a single API, which is being improved, rather than two which are stagnant. In a word, building and running old application code remains possible; maintaining current application code and writing new code becomes simpler, and all without having to use a different version of regex from gnulib's. Finally, it is not obvious to me that it is desirable to add further features to glibc. glibc is a system library, it implements system APIs, and on a GNU system that means POSIX and GNU regex APIs. The place for extensions is in a non-system library. Given nonetheless fully backwards-compatible development, it should be possible to feed bug fixes and algorithmic improvements into glibc, without preventing further improvements to the API. That was rather a long email, but the list of actions that results is small and simple: 1. Agree a method of managing syncing with glibc that allows for further development; implement an agreed way of keeping track of syncing. 2. Agree a policy on changes to gnulib's GNU regex (my conservative suggestion was "backwards ABI and API-compatible changes only"). 3. In the light of this, consider various proposals for improving GNU regex (some of which I've made above). -- http://rrt.sc3d.org
