Terence Parr <parrt <at> cs.usfca.edu> writes: > > > On Jun 18, 2009, at 6:20 PM, Michaeljohn Clement wrote: > > I think generating acceptable error messages from a parser alone is an > > interesting hard problem. It might be possible to do some statistical > > analysis on a corpus of valid inputs and then derive heuristics to > > suggestwhat the most likely error in the input string might be. > > Has anyone thought of parsing backwards from the end towards the > detected error location? If you parse forwards and backwards you might > be able to zoom in on a problem area. Of course if there are lots of > errors following the first one, it won't help you too much. It's sort > of what a human does though, isn't it? We look down a few tokens and > work our way back up to see if we can make sense of things. > > The other thing I wondered about. Can we launch a whole bunch of > threads using multiple core to sniff the input to improve error > analysis? Maybe we launch parsers at multiple points in the input > stream and then use the interpretation that yields the fewest errors. > > Just random thoughts. Let's use those cores, man! Right now, all they > do is run Pandora and instant messaging for me. ;) > > Ter >
Parsing backwards is interesting; however, as you've mentioned, one might have trouble when it comes to having more than one parse error in the document. Another idea to get people thinking about might be phrase-level context-sensitive errors where context sensitivity is achieved by matching some number of parser frames (each being the application of a production rule) on the top of the stack against a given pattern so that if a phrase fails to match it will check some sequence(s) of production rules against the top frames on the stack and err before it backtracks to apply the next phrase in the production rule or cascades a failure. Another idea, previously mentioned, is a production-rule-level parse error, where if a production rule fails to match one of its phrases then it will simply cause a parse error. This is very appealing, especially for production rules with only one phrase where an error is only detectable on failure of the production rule. The following is an example of this: Type : 'int' : 'float' : 'char' ; Identifier : !Type ... ; IdentifierDeclaration : Type Identifier IdentifierList <semicolon> ; IdentifierList : <comma> Identifier IdentifierList : <> ; For the input "int foo, bar float;" the parser will go, and when applying IndentifierList(", bar float;"), it will match the comma and the identifier and *also* the IdentifierList (by failing on 'float' and backtracking to match <> (epsilon)). IdentDeclaration will then fail when it fails to match the semicolon. This is an annoying case where a parse error is somewhat obscured by a successful application of a production rule. Really, the parse error occurs in IdentifierList, but only appears when we fail IdentDeclaration and subsequently cascade off of the parser stack. _______________________________________________ PEG mailing list PEG@lists.csail.mit.edu https://lists.csail.mit.edu/mailman/listinfo/peg