Re: [bitc-dev] Unicode RegExp Hell

Jonathan S. Shapiro Wed, 16 Apr 2014 23:06:22 -0700

Addenda:

1. I'm now assuming that the goal is to process UTF-8 encoded input. I
failed to say so in the previous post, but given that input files are
specified as UTF-8, it seems irredeemably silly to first expand them to
UCS2 and then contract them for regular expression purposes. In short, both
Java and C# have input processing on text files hopelessly borked.


2. The main issue to decide in the debate between ONMATCH and CHAR c /pc/
would appear to be constraints on rewriting. I'm perfectly comfortable with
declaring that there are bracketing constraints on bytecode, e.g. that an
opening ONMATCH must be bracketed by FAIL. I'm also comfortable with saying
that the "scope" of an ONMATCH is lexical, in the sense that a JMP
instruction that exits the ONMATCH/FAIL pair has the side effect of setting
the contextually prevailing match to "undefined", with the implication that
a successfully matching CHAR instruction results in an exceptional outcome.
Both constraints can reasonably be maintained by an NFA->DFA converter.
Finally, I'm comfortable declaring that (as a well-formedness constraint)
ONMATCH/FAIL brackets may not nest.


shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode RegExp Hell

Reply via email to