Re: Scanless lexer doesn't try shorter lexem

Peter Stuifzand Mon, 06 Jan 2014 16:12:12 -0800

I didn't know it was off list. It seems I need to learn how to use my new phone.

LTM is pretty useful when parsing programming languages, but hard to work with in data like or ad hoc formats in my experience.

Peter

On Jan 7, 2014 12:56 AM, Ruslan Zakirov <[email protected]> wrote:

Hi,

Peter mentioned Longest-tokens-match off list an hour ago and I only noticed it 5 minutes ago. This is what I was not expecting from scanerless interface.

This means Repa is still valid thing. I should kill all attempts at continuos parsing in it and release.

Pauses and manual lexing are not "sexy" :)

What is IRIF? Is it new marpa front end with inline actions?

On Tue, Jan 7, 2014 at 3:35 AM, Jeffrey Kegler <[email protected]> wrote:

First off, welcome back. Since you've been away for a while, allow me to let new readers know that you're the founder of this mailing list, and someone whose support and advice have been very valuable to Marpa.

Second, I'm about to dive into the answer, but I'm very open to ideas that would make Marpa easier to use.

Marpa does an old-fashioned longest-tokens-match. I does have information about tokens expected, but it imitates traditional parsers in not using that. At the beginning, it looks for the longest token. If there is only one, and it is not acceptable to the grammar, the parse fails. In this case it finds a <value>, and because a <value> is not acceptable first thing, the parse fails.

Longest-tokens-match requires that you contrive it so that the longest token, including those which the grammar will not find acceptable, is always the one you want. Could Marpa do it differently? Yes, and it will in the future. (Aside @amon: perhaps the IRIF already does better?)

-- jeffrey

On 01/06/2014 03:08 PM, Ruslan Zakirov wrote:

Hi,

Shorter script that demos problem: https://gist.github.com/ruz/8291475

Comments below:

On Tue, Jan 7, 2014 at 2:57 AM, Ron Savage <[email protected]> wrote:

I made some small changes:

ron@zigzag:~/Documents/repos/marpa.papers$ diff ~/bin/vcard.parser.orig.pl ~/bin/vcard.parser.pl

0a1,2

> #!/usr/bin/env perl

>

7c9

< my $syntax = <<'END';

---

> my $syntax = <<'EOS';

15,17c17,19

< group ~ A_D_D

< name ~ A_D_D

< params ::= ';' param_list | empty

---

> group ::= A_D_D

> name ::= A_D_D

> params ::= SEMICOLON param_list | empty

21c23

< any_param_name ~ A_D_D

---

> any_param_name ::= A_D_D

86c88

< END

---

> EOS

89c91

< say "rules L0:\n", $grammar->show_rules(1, 'G0');

---

> #say "rules L0:\n", $grammar->show_rules(1, 'G0');

and I get:

ron@zigzag:~/Documents/repos/marpa.papers$ ~/bin/vcard.parser.pl

Setting trace_terminals option

Lexer "L0" rejected lexeme L1c1-11: text; value="BEGIN:VCARD"

Lexer "L0" accepted lexeme L1c1-11: 'BEGIN:VCARD'; value="BEGIN:VCARD"

You see here that lexer rejected text rule, but accepted literal rule of the same length.

Lexer "L0" accepted lexeme L1c12: CRLF; value="

"

Lexer "L0" rejected lexeme L2c1-11: text; value="VERSION:4.0"

Lexer "L0" accepted lexeme L2c1-11: 'VERSION:4.0'; value="VERSION:4.0"

Once again.

Lexer "L0" accepted lexeme L2c12: CRLF; value="

"

Lexer "L0" rejected lexeme L3c1-49: text; value="UID:urn:uuid:4fbe8971-0bc3-424c-9c26-36c3e1eff6b1"

Here lexer went for longer match and never tried A_D_D; value="UID".

progress:

P0 @0-0 L1c1 vCards -> . vCard +

P1 @0-0 L1c1 vCard -> . 'BEGIN:VCARD' CRLF 'VERSION:4.0' CRLF content 'END:VCARD'

P36 @0-0 L1c1 :start -> . vCards

R1:1 @0-1 L1c1-11 vCard -> 'BEGIN:VCARD' . CRLF 'VERSION:4.0' CRLF content 'END:VCARD'

R1:2 @0-2 L1c1-12 vCard -> 'BEGIN:VCARD' CRLF . 'VERSION:4.0' CRLF content 'END:VCARD'

R1:3 @0-3 L1c1-L2c11 vCard -> 'BEGIN:VCARD' CRLF 'VERSION:4.0' . CRLF content 'END:VCARD'

R1:4 @0-4 L1c1-L2c12 vCard -> 'BEGIN:VCARD' CRLF 'VERSION:4.0' CRLF . content 'END:VCARD'

P2 @4-4 L2c12 content -> . content_line +

P3 @4-4 L2c12 content_line -> . content_name params ':' value CRLF

P4 @4-4 L2c12 content_name -> . name

P5 @4-4 L2c12 content_name -> . group '.' name

P6 @4-4 L2c12 group -> . A_D_D

P7 @4-4 L2c12 name -> . A_D_D

Error in SLIF parse: No lexemes accepted at line 3, column 1

Lexer "L0" rejected 1 lexeme(s)

Rejected lexeme #1: text; value="UID:urn:uuid:4fbe8971-0bc3-424c-9c26-36c3e1eff6b1"; length = 49

* String before error: BEGIN:VCARD\nVERSION:4.0\n

* The error was at line 3, column 1, and at character 0x0055 'U', ...

* here: UID:urn:uuid:4fbe8971-0bc3-424c-9c26-36c3e1eff6b1\n

Marpa::R2 exception at /home/ron/bin/vcard.parser.pl line 96.

So it is trying A_D_D.

Sure. Recognizer waits for A_D_D, but lexer never offers it.

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

--
Best regards, Ruslan.

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

--
Best regards, Ruslan.

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Scanless lexer doesn't try shorter lexem

Reply via email to