On 06-Aug-2015 11:26, MakersF wrote:
On Sunday, 2 August 2015 at 18:22:01 UTC, Jacob Carlborg wrote:
On 02/08/15 19:15, Xinok wrote:
I guess you're not familiar with the theoretical aspect of "formal
languages". The D grammar is a context-free grammar which cannot be
reduced to a regular expression. As cym13 stated, there are some simple
context-free grammars which can be rewritten as regular expressions, but
the D grammar cannot be. Take a look at the Chomsky Hierarchy [1] for a
better understanding.
The classic example of a context-free language is the set of balanced
parenthesis, i.e. (()) is balanced and ())))) is not. This language is
not regular meaning you cannot write a regular expression for it, but
you can write a context-free grammar for it.
TextMate grammars are not _just_ regular expressions. They can define
balanced parentheses [1].
The point of a language grammar in a text editor is not to have a 100%
correct implementation of the grammar. Rather it should syntax
highlight the code in a way that is useful for the user.
[1] https://manual.macromates.com/en/language_grammars
Then your best shot is to approximate the grammar with the regual
expressions you have access to. You'll get to a point where some
constructs can not be correctly represented; at that point you should
probably write a regex which produces what the grammar produces and some
more.
If one limits the depth of nested constructs to some reasonable value
(e.g. 5-6) then any context-free grammar is regular. In big grammars
combinatorial explosion may get quite high so that limit better be low.
If regular expressions are not hardcoded but rather themsleves are
generated then it's quite feasible to do. In fact Perl folks have been
doing this "approximate by regex" for years.
In the example before of generating paired interleaved parentheses, you
could generate every possible combination of parentheses, like
( (|)|[|]|{|}|" )*
where only the external parentheses are syntax for the regex. That regex
matches all the productions of the paired parentheses grammar, and many
more strings.
Here again a regex constructed to match e.g. 10-level deep expressions
is more then enough. Like this:
unittest
{
import std.regex;
string x = ""; // the first level is going to be ([^()]*\([^()]*\)?)*
foreach(_; 0..10)
x = `([^()]*\([^()]*`~ x ~`\)?)*`; // pump an extra level of
parenthesised expression
x = "^"~x~"$"; //make sure we match the whole string
assert(matchFirst("a+(b*c-d*(e+45)+(aaaa-(d))+(c*2+1)", x));
}
--
Dmitry Olshansky