Alan Manuel Gloria: > *shrug* yes, we should use some tool. > > For what it's worth, Haskell's Parsec library combined with Haskell's > Monad Transformer library allows using the same syntax for both INDENT > / DEDENT / SAME -guarded parsers, and basic parsers (like n-expr) that > don't have INDENT/DEDENT/SAME, and allowing the second to be used > inside the first. > > Parsec also defaults to LL(1), meaning 1-item lookahead, which is a > necessity in actual Scheme implementations, since we expect to require > only peek-char (it supports limited-length lookahead using the 'try' > combinator, so if we avoid that, we know it's strictly LL(1)). So if > we can get it working in Parsec, we can reasonably expect to get it > working in Scheme implementations without unget-char, only peek-char.
I agree that having a parser with significant indent-processing capabilities would be a big plus. I didn't include Parsec for several reasons, all based on the fact that Parsec is totally tied to Haskell: * I have serious notation concerns. We want to create a spec that will be read by others as part of the SRFI. ANTLR's notation is really excellent, it looks "just like the books". APG's is nasty. When I look at the Parsec example here: http://en.wikibooks.org/wiki/Write_Yourself_a_Scheme_in_48_Hours/Parsing it's clear that the Parsec notation is basically... Haskell. (Well, what a surprise :-) ). I like Parsec's notation better than APG's, but Parsec's notation is a REALLY different notation than "usual" BNFs and is not at all "what the books use". * A *lot* of people don't grok Haskell, and it's certainly not my strongest language either. I particularly worry that we'll need to handle certain cases specially if we want to seriously implement the spec with the tool, and at that point I'll end up throwing up my hands. I'm confident of my ability to fiddle with ANTLR and its ilk, but not with Parsec/Haskell. There are ports of Parsec, but they're tied to their languages too, and it's not clear that the ports are as widely used/supported. * It'd also be nice to be able to generate Javascript, so that we could have it working directly on the website. Parsec can't do that, again since it's tied to Haskell. That's not as important as the other issues. I won't *categorically* rule out Parsec... just say that there were *reasons* I didn't seriously consider Parsec. Do you (or anyone else here) have experience with Parsec, ANTLR, or similar? I've used bison/yacc several times, and I've done recursive descent by hand, but I've never used an LL-based parsing tool. > You know, "guarded" parsers are not standard in parsing lore. So we > may need to hack support for INDENT / DEDENT/ SAME on whatever parser > generator we use. Ideally, we should be able to delete the actions of > the parser in the parser spec and the parser will still, at the > minimum, be able to either signal a parse completion, or a parse > failure. So even if we use a tool, I suspect that ideally, we would > have a translator on top of this tool (a preprocessor) that provides > INDENT/DEDENT/SAME or makes some valid transformation. I think you're right we'll probably need to handle indentation specially no matter what. Traditionally parsers have a lexing preprocessor, and it appears to me that most people just bake indentation handling into a preprocessor of some kind. Handling abbreviation+space is easily handled that way too. I've done a little reading on ANTLR, which appears to be one of the major LL-based parsers around. Several people *have* implemented indentation processing in it as well, though it's certainly not a strength of ANTLR. A problem with any of these tools is that there are some complicating factors in sweet-expressions that make it easy to use and understand, but unusual to parse: * Indentation-sensitivity... but only outside character pairs * The "\\" and "$" have a different semantic meaning from "{\\}" and "{$}". One obvious way is to memorize the first character, read in an n-expr, and then compare, but that doesn't sit well with traditional LL-tools. * The "$" and "\\" have slightly different semantics at beginning of line vs. middle of line. By itself, trivial, but less so when combined with above. * Abbreviation + space/eol after any indent has a special meaning * # <CHAR> can do so many things, e.g., #|...|#, and in some cases can set the indent level. * ; on a line by itself is ignored. The real challenge is trying not to read any characters unless truly necessary, so we can reuse the underlying readers, and that drives us towards LL-style and recursive descent parsers. --- David A. Wheeler ------------------------------------------------------------------------------ LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d _______________________________________________ Readable-discuss mailing list Readable-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/readable-discuss