So I'm doing a parser that recognizes email addresses in an html document,
in order to obfuscate them.
This is a slightly simplified version of my current grammar :
main : |*
((email_chars+) >email_start ('.' email_chars+ )* '@' @email_confirmed
(domain_part '.')+ domain_part) $email_max => email_end;
# turn off email scanning until the end of the tag
'<' html_tag => { RESET(); fgoto tag; };
# turn off email scanning until the end of the comment
'<--' => { RESET(); fgoto comment; };
any => { RESET(); }
*|;
email_chars = [a-zA-Z0-9#&+~_\-];
domain_part = [a-zA-Z0-9] ([a-zA-Z0-9\-]* [a-zA-Z0-9])?;
RESET(); is a macro to reset some internal tracking variables (set by
actions such as email_start, email_confirmed, etc...).
html_tag is the list of all possible html tags.
When I transform this into the pure state machine I described earlier (all
expressions unioned and wrapped with a kleene star),
Some email don't match anymore, and I get parse errors.
It works currently, but I think if I could suppress the need for
backtracking, the performances could really improve.
Thanks,
Matthieu.
On Tue, Feb 23, 2010 at 6:24 AM, Adrian Thurston <
[email protected]> wrote:
>
> Matthieu Tourne wrote:
>
>> I've tried that without much success, I have a union with all my scanner
>> patterns, wrapped in ()**.
>> I have also replaced all the => { do_stuff(); } in the scanner with @{
>> do_stuff(); } for each pattern.
>>
>
> If you want, post the specifics and we might be able to nail down the
> problem.
>
>
> So, I'm back to using a scanner construction, but resetting the
>> backtracking in between buffers, and it seems to work fine.
>> Are there any concerns with doing something like that ?
>>
>
> No.
>
> -Adrian
>
>
> _______________________________________________
> ragel-users mailing list
> [email protected]
> http://www.complang.org/mailman/listinfo/ragel-users
>
--
Matthieu Tourne
_______________________________________________
ragel-users mailing list
[email protected]
http://www.complang.org/mailman/listinfo/ragel-users