Personally I think people need to know when not to use a regex and when to switch to a proper grammar/parser. I've been in that boat myself and made the mistake more than a few times.
When the regex itself is actually more complicated and difficult to understand vs the grammar, the value of continuing to use it is gone. At that point it's unlikely the particular regex will even outperform a grammar based solution as it is. Self-forcing of regex in all cases because it's familiar is pretty irrational but probably a consequence of the vast majority of people using them not having a lot of experience with proper parsers. At the end of the day, we're all writing programmatic scrolls for little state machines as encoded by the particular language chosen (regex, grammar, etc) but the language chosen to write them should be sane. Why are regexes chosen so often? Familiarity and false sense of programmar efficiency. Any non-trivial regex eventually turns into a pretty ridiculously large pattern with multiple alternatives that wouldn't even be readable without an /x flag. Individuals keep adding more to said patterns, all the while just sinking costs into something where they should just stop the madness and switch to a classic grammar/parser approach. >From a technical perspective, when multiple alternative, but valid, patterns >show up that require stateful logic is when grammars should be considered. The >splitting of rules vs tokens as a generalized parsing approach is quite clean >from an abstraction POV as well. We have rules, they define the way something >should look and the order of elements that fit into the rules. We have tokens, >they define what something actually is as coming from a sequence of >bits/bytes. In a sense, such languages separate "code" from "data" and will >always win from a maintainability standpoint because the approach is >inherently structured, organized, and with less baked in data. It also helps that in most of the non-trivial cases they're usually faster too. On Jun 9, 2014, at 1050 PT, Jeffrey Kegler <[email protected]> wrote: > By the way, another target of opportunity is a regex engine which detects > "hard" and "easy" regexes. Most regexes it would handle in the ordinary way, > with a regex engine, but the hard ones it hands over to Marpa. This might > prove popular because people *want* to do everything with a regex. This > would allow them to. It'd make a great Perl extension. > > -- jeffrey > > On 06/09/2014 10:03 AM, Steven Haryanto wrote: >> Thanks for the answer and explanation. I see that the second approach is >> about 50% faster on my PC. Although speed-wise it's not on par with regex >> for this simple case[*], it's interesting nevertheless and will be useful in >> certain cases. >> >> *) Did a simple benchmark for string: ("a" x 1000) . " 1+2 " . ("a" x 1000). >> With regex search: while ($input =~ /(\d+(\s*\+\s*\d+)*)/g) { ... } I get >> around 250k searches/sec. With the Marpa grammars I get +- 200/sec and +- >> 300/sec. >> >> Regards, >> Steven -- You received this message because you are subscribed to the Google Groups "marpa parser" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
