Hi Ivan, Thanks a lot for your explanation. It helped me find the key I'd been missing: "The SPARQL grammar is LL(1) when the rules with uppercased names are used as terminals." So the uppercased names *do* need some kind of backtracking.
My example with the parentheses was a bit silly (indeed I'd misread the grammar slightly), but I was trying to illustrate the basic problem which still holds: PN_PREFIX and PN_LOCAL can't end with ".": PN_PREFIX ::= PN_CHARS_BASE ((PN_CHARS|'.')* PN_CHARS)? PN_LOCAL ::= (PN_CHARS_U | ':' | [0-9] | PLX ) ((PN_CHARS | '.' | ':' | PLX)* (PN_CHARS | ':' | PLX) )? (from https://www.w3.org/TR/sparql11-query/#sparqlGrammar) As I mentioned, I was able to write a simple backtracking star to parse these, but it didn't compose well with bar, since bar doesn't send back the error which is needed to trigger the backtracking. The best I've been able to get working is a "sandbox" combinator (code below) that limits the scope of the backtracking, so I don't have to modify anything else in abnf/lexgen or my parser. Does it look correct to you? I hope that since these are the only two places in the grammar that require backtracking, and I'm able to isolate it, that the performance hit won't be to big...? Nathaniel (define (bstar p) (lambda (sk fk strm) (let ((try (lambda (s) (let ((ss (sk s))) (if (equal? ss '(error)) #f ss))))) (p (lambda (strm1) (or ((bstar p) try try strm1) (sk strm))) sk strm)))) (define (sandbox p) (lambda (sk fk strm) (let ((s (p values err strm))) (if (equal? s '(error)) (fk strm) (sk s))))) (define PN_PREFIX (concatenation PN_CHARS_BASE (optional-sequence (sandbox (concatenation (bstar (alternatives (char-list/lit ".") PN_CHARS)) PN_CHARS))))) On Fri, Sep 15, 2017 at 5:42 PM, Ivan Raikov <[email protected]> wrote: > Hi Nathaniel, > > No problem, thanks for trying to use lexgen and abnf. The excerpt > from the grammar does not include the rules for PNAME, or PNAME_NS, > but if I understand correctly, you are suggesting that PrefixedName > could include parentheses which could cause the closing parenthesis of > DataBlockValue to be consumed incorrectly. This would indeed require > general backtracking, but it is a bit surprising to me, because > usually ABNF grammars avoid backtracking as it can be very costly in > terms of memory and computation time. For detailed explanation see > e.g. here: http://www.artima.com/pins1ed/combinator-parsing.html#31.10 > > Parser combinator libraries such as lexgen tend to favor LL > grammars and avoid backtracking because the latter requires keeping > multiple copies of the input stream, and traversing back one token at > a time until a rule succeeds, which usually ends up being > prohibitively slow. I suppose a naive backtracking combinator in > lexgen could be implemented using something analogous to the > bind/rebind combinators, but instead the argument combinator would be > applied to the previous copy of the input stream in case of failure. > But I suspect this would be too slow to be of any practical use. > > Are you absolutely certain that the parentheses in PrefixedName can > occur as something other than in escaped characters? Again, requiring > left recursion and general backtracking is very unusual for an ABNF > grammar, so better to make sure you understand the various PN_ rules. > I am happy to help with that, just let me know. > > -Ivan > > > On Fri, Sep 15, 2017 at 1:45 AM, Nathaniel Rudavsky-Brody > <[email protected]> wrote: > > Hi Ivan, > > > > Thanks for replying, and sorry for being unclear. By different levels I > just > > mean that I need a non-greedy * for a low-level rule that gets combined > in > > different ways higher up. I'm able to write a correct parser by > "flattening > > out" the rules, but since it's a big grammar (173 rules) that quickly > gets > > messy. > > > > Here's an excerpt from the real rules. Practically speaking, the problem > is > > that by itself, "ex:abc)" is a valid PrefixedName, but in the expression > > "(?var) { (ex:abc) }" only "ex:abc" should be parsed as a PrefixedName, > and > > the ")" as part of the InlineDataFull. > > > > > > PN_LOCAL_ESC ::= '\' ( '_' | '~' | '.' | '-' | '!' | '$' | '&' | "'" | > '(' > > | ')' | '*' | '+' | ',' | ';' | '=' | '/' | '?' | '#' | '@' | '%' ) > > > > PLX ::= PERCENT | PN_LOCAL_ESC > > > > PN_LOCAL ::= (PN_CHARS_U | ':' | [0-9] | PLX ) ((PN_CHARS | '.' | ':' | > > PLX)* (PN_CHARS | ':' | PLX) )? > > > > PNAME_LN ::= PNAME > > > > PrefixedName ::= PNAME_LN | PNAME_NS > > > > iri ::= IRIREF | PrefixedName > > > > DataBlockValue ::= iri | RDFLiteral | NumericLiteral | BooleanLiteral | > > 'UNDEF' > > > > InlineDataFull ::= ( NIL | '(' Var* ')' ) '{' ( '(' DataBlockValue* > ')' | > > NIL )* '}' > > > > > > > > With my limited understanding of lexgen's internals, I can write a * that > > backtracks if it catches an error from a later success continuation, but > > this doesn't compose with something like opt that doesn't signal an > error: > > > > (define (bstar p) > > (lambda (sk fk strm) > > (let ((try > > (lambda (s) > > (let ((ss (sk s))) > > (if (equal? ss '(error)) #f > > ss))))) > > (p (lambda (strm1) > > (or > > ((bstar p) try sk strm1) > > (sk strm))) > > sk > > strm)))) > > > > > > (lex (seq (bstar (lit "a")) (lit "a")) err "aaaaab")) ; => ((a a a a a) > (b)) > > > > but > > > > (lex (opt (seq (bstar (lit a)) (lit a))) err "aaaaab") ; => (() (a a a > a a > > b)) > > > > > > Does this make more sense? > > > > Nathaniel > > > > > > On Thu, Sep 14, 2017 at 9:54 PM, Ivan Raikov <[email protected]> > > wrote: > >> > >> Hi Nathaniel, > >> > >> I don't understand what you mean by "different levels" here. Can > >> you express your grammar in ABNF notation? > >> > >> -Ivan > >> > >> > >> On Thu, Sep 14, 2017 at 4:06 AM, Nathaniel Rudavsky-Brody > >> <[email protected]> wrote: > >> > Hi, > >> > > >> > I was wondering if anyone might have some advice for a beginner's > >> > question > >> > with abnf/lexgen. Clearly it has something to do with backtracking, > and > >> > I > >> > hope I'm just missing something simple -- 1500 lines into my parser, > I'd > >> > hate to have to start from scratch with another approach! > >> > > >> > > >> > For matching a string of a's and b's that ends with an "a", clearly > >> > > >> > > >> > (lex (seq (star (bar (lit "a") (lit "b"))) > >> > (lit "a")) > >> > error "ababa") ; => error > >> > > >> > > >> > For patterns on the same level, I can do manual backtracking: > >> > > >> > > >> > (define-syntax vac > >> > (syntax-rules () > >> > ((_ fn) (lambda args (apply fn args))))) > >> > > >> > (define (left-star-seq p1 p2) ; p1*p2 > >> > (vac > >> > (bar > >> > (seq p1 (left-star-seq p1 p2)) > >> > p2))) > >> > > >> > (lex (left-star-seq > >> > (bar (lit "a") (lit "b")) > >> > (lit "a")) > >> > error "ababa") ; => ((#\a #\b #\a #\b #\a) ()) > >> > > >> > > >> > but what if p1 and p2 are on completely different levels, something > >> > like: > >> > > >> > > >> > (define p1 (star (bar (lit "a") (lit "b")))) > >> > (define p2 (lit "a")) > >> > ... > >> > (define q2 (seq q1 p1)) > >> > (define q4 (seq q3 q2)) > >> > (define Word (seq q4 p2)) > >> > > >> > > >> > Is there way to do "composable" backtracking with these tools? (and is > >> > the > >> > question even the right one?) > >> > > >> > Thanks for any ideas, > >> > > >> > Nathaniel > >> > > >> > _______________________________________________ > >> > Chicken-users mailing list > >> > [email protected] > >> > https://lists.nongnu.org/mailman/listinfo/chicken-users > >> > > > > > >
_______________________________________________ Chicken-users mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/chicken-users
