BTW, I created a rough sketch of a strawman here: https://github.com/isiahmeadows/streaming-regexp-proposal -----
Isiah Meadows [email protected] www.isiahmeadows.com On Mon, Jul 30, 2018 at 5:48 PM, Isiah Meadows <[email protected]> wrote: > I was thinking in terms of what hooks need overridden (none for the > base proposal), not where the method itself lived. > ----- > > Isiah Meadows > [email protected] > www.isiahmeadows.com > > > On Mon, Jul 30, 2018 at 5:44 PM, Jordan Harband <[email protected]> wrote: >> (It'd have to be a subclass for `RegExp.prototype.whatever.call` to do the >> right thing for your alternative regex) >> >> On Mon, Jul 30, 2018 at 12:36 PM, Isiah Meadows <[email protected]> >> wrote: >>> >>> Not yet, but if I were, it wouldn't be a subclass (it's not >>> necessary). But the key trouble implementing this is that I'd have to >>> reimplement the regexp matching logic entirely from scratch if I want >>> remotely reasonable runtime complexity. As for why: >>> >>> - If you ignore backreferences (which IIUC makes JS regexps go from >>> regular to context-sensitive recognizers), it's *possible* to >>> implement partial matching using regexp rewriting, but only because JS >>> doesn't have any other arcane enough features to make it infeasible. >>> - There is no possible way to extract duplicate matches without >>> rewriting the regexp entirely and using `regexp.exec`, and the >>> complexity of the logic to do this generally I suspect is NP-complete, >>> maybe PSPACE-complete, and in either case, certainly infeasible when >>> backreferences enter the picture. >>> - It's technically possible to refactor a string for streaming, but I >>> then lose the ability to discern a match from a non-match. I also have >>> the same issue as per above WRT duplicate matches and partial matching >>> with complexity issues. >>> - If you were to combine the three, rewriting the regexp generally to >>> support streaming matches, including duplicates, I suspect would >>> likely be PSPACE-complete because it seems *very* similar to the first >>> problem listed here [1]. >>> >>> \* Backreferences bring the grammatical complexity of JS regexps from >>> I'm pretty sure regular to context-sensitive. >>> >>> Or in other words, either you control the raw matching logic yourself, >>> or the polyfill runtime complexity could get absurd for this proposal. >>> And implementing a pushdown state machine-based regexp engine + parser >>> in JS isn't exactly something I'm willing to prototype for a strawman. >>> >>> [1]: >>> https://en.wikipedia.org/wiki/PSPACE-complete#Regular_expressions_and_automata >>> >>> ----- >>> >>> Isiah Meadows >>> [email protected] >>> www.isiahmeadows.com >>> >>> >>> On Mon, Jul 30, 2018 at 2:44 PM, Jordan Harband <[email protected]> wrote: >>> > Have you tried to implement this as a RegExp subclass, overriding all >>> > the >>> > necessary extension points? >>> > >>> > On Mon, Jul 30, 2018 at 11:39 AM, Isiah Meadows <[email protected]> >>> > wrote: >>> >> >>> >> There's two things I've found that need suspendable matching: >>> >> >>> >> 1. Partially matching against a string, which is useful with >>> >> interactive form validation and such. >>> >> 2. Pattern matching and replacement over a character stream, which is >>> >> useful for things like matching against files without loading the >>> >> entire thing into memory or easier filtering of requests. >>> >> >>> >> Also, it'd be nice if there was a facility to get *all* matches, >>> >> including duplicate group matches. This is often useful for simple >>> >> parsing, where if such support existed, you could just use a Kleene >>> >> star instead of the standard `exec` loops (which admittedly get old). >>> >> >>> >> And finally, we could avoid setting regexp globals here. That would >>> >> speed up the matcher quite a bit. >>> >> >>> >> So, here's my proposal: >>> >> >>> >> - `regexp.matcher() -> matcher` - Create a streaming regexp matcher. >>> >> - `matcher.consume(codePoint, charSize?) -> result | undefined` - >>> >> Consume a Unicode code point or `-1` if no more characters exist, and >>> >> return a match result, `undefined` if no match occurred. `charSize` is >>> >> the number of bytes represented by `codePoint` (default: 1-2 if `/u` >>> >> is set, 1 otherwise), so it can work with other encodings flexibly. >>> >> - `matcher.nextPossibleStart -> number` - The next possible start the >>> >> matcher could have, for more effective buffering and stream >>> >> management. This is implementation-defined, but it *must* be be `-1` >>> >> after the matcher completes, and it *must* be within [0, N) otherwise, >>> >> where N is the next returned match. >>> >> - `result.group -> string | number | undefined` - Return the group >>> >> index/name of the current match, or `undefined` if it's just issuing a >>> >> match of the global regexp. >>> >> - `result.start -> number` - Return the matched value's start index. >>> >> - `result.end -> number` - Return the matched value's end index. >>> >> - This does *not* modify any globals or regexp instance members. It >>> >> only reads `regexp.lastIndex` on creation. (It doesn't operate on >>> >> strings, so it shouldn't return any it doesn't already have.) >>> >> >>> >> Most RegExp methods could similarly be built using this as a base: if >>> >> they work on strings, they can iterate their code points. >>> >> >>> >> As for the various concerns: >>> >> >>> >> - Partial matching is just iterating a string's character codes and >>> >> seeing if the matcher ever returned non-`undefined`. >>> >> - Streaming pattern matching is pretty obvious from just reading the >>> >> API. >>> >> - Getting all matches is just iterating the string and returning an >>> >> object with all the groups + strings it matched. >>> >> >>> >> So WDYT? >>> >> >>> >> /cc Mathias Bynens, since I know you're involved in this kind of >>> >> text-heavy stuff. >>> >> >>> >> ----- >>> >> >>> >> Isiah Meadows >>> >> [email protected] >>> >> www.isiahmeadows.com >>> >> _______________________________________________ >>> >> es-discuss mailing list >>> >> [email protected] >>> >> https://mail.mozilla.org/listinfo/es-discuss >>> > >>> > >> >> _______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

