Re: [ragel-users] Breaking out of a scanner

Matthieu Tourne Mon, 08 Mar 2010 15:56:31 -0800

Thank you for the advice,

But since then I've tried to rewrite a good chunk of the grammar so I
wouldn't have to deal with any unintended backtracking.


Now it looks like this :

email = ((email_chars)+ >email_start ('.' email_chars+ )* '@'
@email_confirmed
     domain_name tld) $email_max %email_end %err(email_end);

main := (
     email

     | '<' >tag_start html_tag

     )** $err(main);

the action main just resets some internal variable and advances p.

But I run into other problems, for example if a tag is split across two
different buffers, it won't be correctly identified.
For instance if </body> is across two buffers, it'll go into tag_start on
the '<', but when the next buffer comes in it will try to match 'dy>' as an
email.

Maybe this would work better if I explicitly jump to other entry points,
rather than staying into main ?
Or is a scanner usually best suited for what I'm trying to achieve ?

Thanks,

Matthieu.

On Thu, Feb 25, 2010 at 7:19 AM, Adrian Thurston <
[email protected]> wrote:

> So this scanner will backtrack quite a bit. Every word that turns out to
> not be a an email address will be processed starting from every character.
> To eliminate that you can add a pattern just before any that consumes
> email_chars+. As long as it doesn't contain '@' it will just replace the
> default for something that looks almost like an email, but isn't quite.
>
> -Adrian
>
> Matthieu Tourne wrote:
>
>> So I'm doing a parser that recognizes email addresses in an html document,
>> in order to obfuscate them.
>>
>> This is a slightly simplified version of my current grammar :
>>
>> main : |*
>> ((email_chars+) >email_start ('.' email_chars+ )* '@' @email_confirmed
>> (domain_part '.')+  domain_part) $email_max => email_end;
>>
>> # turn off email scanning until the end of the tag
>> '<' html_tag => { RESET(); fgoto tag; };
>>
>> # turn off email scanning until the end of the comment
>> '<--'  => { RESET(); fgoto comment; };
>>
>> any => { RESET(); }
>>
>> *|;
>>
>> email_chars = [a-zA-Z0-9#&+~_\-];
>> domain_part = [a-zA-Z0-9] ([a-zA-Z0-9\-]* [a-zA-Z0-9])?;
>>
>> RESET(); is a macro to reset some internal tracking variables (set by
>> actions such as email_start, email_confirmed, etc...).
>> html_tag is the list of all possible html tags.
>>
>> When I transform this into the pure state machine I described earlier (all
>> expressions unioned and wrapped with a kleene star),
>> Some email don't match anymore, and I get parse errors.
>> It works currently, but I think if I could suppress the need for
>> backtracking, the performances could really improve.
>>
>> Thanks,
>>
>> Matthieu.
>>
>>
>> On Tue, Feb 23, 2010 at 6:24 AM, Adrian Thurston <
>> [email protected] <mailto:[email protected]>>
>> wrote:
>>
>>
>>    Matthieu Tourne wrote:
>>
>>        I've tried that without much success, I have a union with all my
>>        scanner patterns, wrapped in ()**.
>>        I have also replaced all the => { do_stuff(); } in the scanner
>>        with @{ do_stuff(); } for each pattern.
>>
>>
>>    If you want, post the specifics and we might be able to nail down
>>    the problem.
>>
>>
>>        So, I'm back to using a scanner construction, but resetting the
>>        backtracking in between buffers, and it seems to work fine.
>>        Are there any concerns with doing something like that ?
>>
>>
>>    No.
>>
>>    -Adrian
>>
>>
>>    _______________________________________________
>>    ragel-users mailing list
>>    [email protected] <mailto:[email protected]>
>>
>>    http://www.complang.org/mailman/listinfo/ragel-users
>>
>>
>>
>>
>> --
>> Matthieu Tourne
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>> _______________________________________________
>> ragel-users mailing list
>> [email protected]
>> http://www.complang.org/mailman/listinfo/ragel-users
>>
>
> _______________________________________________
> ragel-users mailing list
> [email protected]
> http://www.complang.org/mailman/listinfo/ragel-users
>



-- 
Matthieu Tourne

_______________________________________________
ragel-users mailing list
[email protected]
http://www.complang.org/mailman/listinfo/ragel-users

Re: [ragel-users] Breaking out of a scanner

Reply via email to