Thank you for the advice,
But since then I've tried to rewrite a good chunk of the grammar so I
wouldn't have to deal with any unintended backtracking.
Now it looks like this :
email = ((email_chars)+ >email_start ('.' email_chars+ )* '@'
@email_confirmed
domain_name tld) $email_max %email_end %err(email_end);
main := (
email
| '<' >tag_start html_tag
)** $err(main);
the action main just resets some internal variable and advances p.
But I run into other problems, for example if a tag is split across two
different buffers, it won't be correctly identified.
For instance if </body> is across two buffers, it'll go into tag_start on
the '<', but when the next buffer comes in it will try to match 'dy>' as an
email.
Maybe this would work better if I explicitly jump to other entry points,
rather than staying into main ?
Or is a scanner usually best suited for what I'm trying to achieve ?
Thanks,
Matthieu.
On Thu, Feb 25, 2010 at 7:19 AM, Adrian Thurston <
[email protected]> wrote:
> So this scanner will backtrack quite a bit. Every word that turns out to
> not be a an email address will be processed starting from every character.
> To eliminate that you can add a pattern just before any that consumes
> email_chars+. As long as it doesn't contain '@' it will just replace the
> default for something that looks almost like an email, but isn't quite.
>
> -Adrian
>
> Matthieu Tourne wrote:
>
>> So I'm doing a parser that recognizes email addresses in an html document,
>> in order to obfuscate them.
>>
>> This is a slightly simplified version of my current grammar :
>>
>> main : |*
>> ((email_chars+) >email_start ('.' email_chars+ )* '@' @email_confirmed
>> (domain_part '.')+ domain_part) $email_max => email_end;
>>
>> # turn off email scanning until the end of the tag
>> '<' html_tag => { RESET(); fgoto tag; };
>>
>> # turn off email scanning until the end of the comment
>> '<--' => { RESET(); fgoto comment; };
>>
>> any => { RESET(); }
>>
>> *|;
>>
>> email_chars = [a-zA-Z0-9#&+~_\-];
>> domain_part = [a-zA-Z0-9] ([a-zA-Z0-9\-]* [a-zA-Z0-9])?;
>>
>> RESET(); is a macro to reset some internal tracking variables (set by
>> actions such as email_start, email_confirmed, etc...).
>> html_tag is the list of all possible html tags.
>>
>> When I transform this into the pure state machine I described earlier (all
>> expressions unioned and wrapped with a kleene star),
>> Some email don't match anymore, and I get parse errors.
>> It works currently, but I think if I could suppress the need for
>> backtracking, the performances could really improve.
>>
>> Thanks,
>>
>> Matthieu.
>>
>>
>> On Tue, Feb 23, 2010 at 6:24 AM, Adrian Thurston <
>> [email protected] <mailto:[email protected]>>
>> wrote:
>>
>>
>> Matthieu Tourne wrote:
>>
>> I've tried that without much success, I have a union with all my
>> scanner patterns, wrapped in ()**.
>> I have also replaced all the => { do_stuff(); } in the scanner
>> with @{ do_stuff(); } for each pattern.
>>
>>
>> If you want, post the specifics and we might be able to nail down
>> the problem.
>>
>>
>> So, I'm back to using a scanner construction, but resetting the
>> backtracking in between buffers, and it seems to work fine.
>> Are there any concerns with doing something like that ?
>>
>>
>> No.
>>
>> -Adrian
>>
>>
>> _______________________________________________
>> ragel-users mailing list
>> [email protected] <mailto:[email protected]>
>>
>> http://www.complang.org/mailman/listinfo/ragel-users
>>
>>
>>
>>
>> --
>> Matthieu Tourne
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>> _______________________________________________
>> ragel-users mailing list
>> [email protected]
>> http://www.complang.org/mailman/listinfo/ragel-users
>>
>
> _______________________________________________
> ragel-users mailing list
> [email protected]
> http://www.complang.org/mailman/listinfo/ragel-users
>
--
Matthieu Tourne
_______________________________________________
ragel-users mailing list
[email protected]
http://www.complang.org/mailman/listinfo/ragel-users