On 17 April 2014 15:56, Jonathan S. Shapiro <[email protected]> wrote: > I have a confession to make: I'm an idiot.
There is way too much clever going around, especially in language and UI design. > The ability to match multiple REs simultaneously (in parallel) and return a > preset ID indicating which one matched. In effect: to return a token ID in > addition to the string matched. This is a "generic" capability. I have been thinking about this thing, too: https://bitbucket.org/william_ml_leslie/match/src/tip/aurochs/test/test_tokeniser.py?at=default I figured I could just maintain the order that the NFA nodes were visited. For this tokeniser, this is definition order - so if you define "do" before "[a-zA-Z_][a-zA-Z0-9_]*", it will match the "do" token first, so that is the token "id" it emits. When desugaring from a grammar in BNF form, I could arrange for the tokeniser to consider the left- or right-recursive definitions and attempt to work along with the lexer. This doesn't compile down to C or equivalent yet (I also didn't bother with bytecode - I don't think tokenising needs to take into account mixfix, I would rather handle it in the lexer) but I just wanted to show you how I thought this problem might be tackled. > Code Points vs UTF-8 Strings > > If you match code points, you soon conclude that you need a way to define > sparse sets of transitions. The underlying problem is that most of the > Unicode character classes are not densely encoded. A naively encoded set of > the form { (codepoint, state *) ... } requires prohibitive space. So the > first thing you do is try to come up with a clever notion of a UCS range set > and a UCS range map that maintains dense encodings where possible. That > isn't as simple as it sounds. In fact, it's frought with opportunities to go > wrong. I think I still disagree - a precomputed byte-array that can represent eight properties that you might test is only about 12 kB, and once you have a codepoint, you can simply index and mask. There is one other thing you need to keep track of when attempting to match several regex at once: You need to check existing DFA branches because those that match a specific character and those that match something like a class can overlap. But you already needed to do most of that work anyway. I guess that's something to try some other day, once I have this compiling down to C. -- William Leslie Notice: Likely much of this email is, by the nature of copyright, covered under copyright law. You absolutely may reproduce any part of it in accordance with the copyright law of the nation you are reading this in. Any attempt to deny you those rights would be illegal without prior contractual agreement. _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
