Re: [ragel-users] Breaking out of a scanner

Adrian Thurston Thu, 25 Feb 2010 07:19:36 -0800

So this scanner will backtrack quite a bit. Every word that turns out tonot be a an email address will be processed starting from everycharacter. To eliminate that you can add a pattern just before any thatconsumes email_chars+. As long as it doesn't contain '@' it will justreplace the default for something that looks almost like an email, butisn't quite.


-Adrian


Matthieu Tourne wrote:

So I'm doing a parser that recognizes email addresses in an htmldocument, in order to obfuscate them.
This is a slightly simplified version of my current grammar :

main : |*
((email_chars+) >email_start ('.' email_chars+ )* '@' @email_confirmed(domain_part '.')+ domain_part) $email_max => email_end;
# turn off email scanning until the end of the tag
'<' html_tag => { RESET(); fgoto tag; };

# turn off email scanning until the end of the comment
'<--'  => { RESET(); fgoto comment; };

any => { RESET(); }

*|;

email_chars = [a-zA-Z0-9#&+~_\-];
domain_part = [a-zA-Z0-9] ([a-zA-Z0-9\-]* [a-zA-Z0-9])?;
RESET(); is a macro to reset some internal tracking variables (set byactions such as email_start, email_confirmed, etc...).
html_tag is the list of all possible html tags.
When I transform this into the pure state machine I described earlier(all expressions unioned and wrapped with a kleene star),
Some email don't match anymore, and I get parse errors.
It works currently, but I think if I could suppress the need forbacktracking, the performances could really improve.
Thanks,

Matthieu.
On Tue, Feb 23, 2010 at 6:24 AM, Adrian Thurston<[email protected] <mailto:[email protected]>> wrote:
    Matthieu Tourne wrote:

        I've tried that without much success, I have a union with all my
        scanner patterns, wrapped in ()**.
        I have also replaced all the => { do_stuff(); } in the scanner
        with @{ do_stuff(); } for each pattern.


    If you want, post the specifics and we might be able to nail down
    the problem.


        So, I'm back to using a scanner construction, but resetting the
        backtracking in between buffers, and it seems to work fine.
        Are there any concerns with doing something like that ?


    No.

    -Adrian


    _______________________________________________
    ragel-users mailing list
    [email protected] <mailto:[email protected]>
    http://www.complang.org/mailman/listinfo/ragel-users




--
Matthieu Tourne


------------------------------------------------------------------------

_______________________________________________
ragel-users mailing list
[email protected]
http://www.complang.org/mailman/listinfo/ragel-users


_______________________________________________
ragel-users mailing list
[email protected]
http://www.complang.org/mailman/listinfo/ragel-users

Re: [ragel-users] Breaking out of a scanner

Reply via email to